jarxiv | Japanese arxiv | ページ 133

Rethinking Range-View LiDAR Segmentation in Adverse Weather

投稿日: 2025年6月11日作成者: jarxiv

要約

LIDARセグメンテーションは、マルチメディアのエクスペリエンスと分析を豊かにするための重要なタスクとして浮上しています。
レンジビューベースの方法は、高い計算効率とリアルタイムの展開との互換性のために人気を博しています。
ただし、有害な気象条件下での一般化されたパフォーマンスは未脱カタリングのままであり、現実世界の環境での信頼性を制限しています。
この作業では、悪天候でのレンジビューライダーセグメンテーションの一般化に影響を与える独自の課題を特定して分析します。
これらの課題に対処するために、既存のモデルのコアアーキテクチャを変更せずに堅牢性を高めるモジュール式および軽量フレームワークを提案します。
当社の方法では、標準範囲ビューネットワークの初期ステムブロックを2つのブランチに再フォーマンして、幾何学的属性と反射強度を個別に処理します。
具体的には、幾何学的異常抑制（GAS）モジュールは、天候に起因する空間ノイズの影響を減らし、反射率歪みキャリブレーション（RDC）モジュールは、メモリ誘導適応インスタンス正規化を介した反射率の歪みを修正します。
処理された機能は融合され、元のセグメンテーションパイプラインに渡されます。
さまざまなベンチマークとベースラインモデルでの広範な実験は、私たちのアプローチが最小限の推論オーバーヘッドで悪天候への一般化を大幅に改善し、実際のライダーセグメンテーションのための実用的で効果的なソリューションを提供することを示しています。

要約(オリジナル)

LiDAR segmentation has emerged as an important task to enrich multimedia experiences and analysis. Range-view-based methods have gained popularity due to their high computational efficiency and compatibility with real-time deployment. However, their generalized performance under adverse weather conditions remains underexplored, limiting their reliability in real-world environments. In this work, we identify and analyze the unique challenges that affect the generalization of range-view LiDAR segmentation in severe weather. To address these challenges, we propose a modular and lightweight framework that enhances robustness without altering the core architecture of existing models. Our method reformulates the initial stem block of standard range-view networks into two branches to process geometric attributes and reflectance intensity separately. Specifically, a Geometric Abnormality Suppression (GAS) module reduces the influence of weather-induced spatial noise, and a Reflectance Distortion Calibration (RDC) module corrects reflectance distortions through memory-guided adaptive instance normalization. The processed features are then fused and passed to the original segmentation pipeline. Extensive experiments on different benchmarks and baseline models demonstrate that our approach significantly improves generalization to adverse weather with minimal inference overhead, offering a practical and effective solution for real-world LiDAR segmentation.

arxiv情報

著者	Longyu Yang,Ping Hu,Lu Zhang,Jun Liu,Yap-Peng Tan,Heng Tao Shen,Xiaofeng Zhu
発行日	2025-06-10 16:48:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models

投稿日: 2025年6月11日作成者: jarxiv

要約

クロスモーダルコントラスト学習を通じて、医療視覚言語の整合により、検索やゼロショット分類などの画像テキストマッチングタスクのパフォーマンスが有望であることが示されています。
ただし、従来のクロスモーダルコントラスト学習（CLIPベース）メソッドは、視覚的整合性の有効性を制限する最適ではない視覚表現能力に悩まされています。
対照的に、マルチモーダルマスクモデリングを介して直接的なクロスモーダルマッチングと闘っているモデルは事前に守られていますが、視覚的表現に優れています。
この矛盾に対処するために、訓練可能なパラメーターの約8％とマスクされたレコードモデリングに必要な計算消費量の1/5未満を利用する効率的な医療視覚アラインメント方法であるAlta（適応による整列）を提案します。
Altaは、Masked Record Modelingから前処理されたビジョンモデルを適応させることにより、検索やゼロショット分類などのビジョン言語マッチングタスクで優れたパフォーマンスを実現します。
さらに、一時的なマルチビューレントゲン写真入力を統合して、レントゲン写真とレポートでの対応する説明との間の情報の一貫性を高め、ビジョン言語の調整をさらに改善します。
実験的評価は、Altaがテキストから画像への精度において4％以上の絶対ポイント、画像間検索精度で約6％の絶対ポイントを超える最高のパフォーマンスのカウンターパートを上回ることを示しています。
効率的なアラインメント中のビジョン言語モデルの適応は、より良いビジョンと言語の理解を促進します。
コードはhttps://github.com/dopaminelcy/altaで公開されています。

要約(オリジナル)

Medical vision-language alignment through cross-modal contrastive learning shows promising performance in image-text matching tasks, such as retrieval and zero-shot classification. However, conventional cross-modal contrastive learning (CLIP-based) methods suffer from suboptimal visual representation capabilities, which also limits their effectiveness in vision-language alignment. In contrast, although the models pretrained via multimodal masked modeling struggle with direct cross-modal matching, they excel in visual representation. To address this contradiction, we propose ALTA (ALign Through Adapting), an efficient medical vision-language alignment method that utilizes only about 8% of the trainable parameters and less than 1/5 of the computational consumption required for masked record modeling. ALTA achieves superior performance in vision-language matching tasks like retrieval and zero-shot classification by adapting the pretrained vision model from masked record modeling. Additionally, we integrate temporal-multiview radiograph inputs to enhance the information consistency between radiographs and their corresponding descriptions in reports, further improving the vision-language alignment. Experimental evaluations show that ALTA outperforms the best-performing counterpart by over 4% absolute points in text-to-image accuracy and approximately 6% absolute points in image-to-text retrieval accuracy. The adaptation of vision-language models during efficient alignment also promotes better vision and language understanding. Code is publicly available at https://github.com/DopamineLcy/ALTA.

arxiv情報

著者	Chenyu Lian,Hong-Yu Zhou,Dongyun Liang,Jing Qin,Liansheng Wang
発行日	2025-06-10 17:02:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

Do Concept Replacement Techniques Really Erase Unacceptable Concepts?

投稿日: 2025年6月11日作成者: jarxiv

要約

生成モデル、特に拡散ベースのテキストからイメージ（T2I）モデルは、驚くべき成功を示しています。
ただし、受け入れられない概念（攻撃的または著作権で保護されたコンテンツ、または有名人の類似性などの概念を持つコンテンツの生成を避けるためにそれらを調整することは依然として重要な課題です。
概念置換技術（CRTS）は、モデルから受け入れられない概念を「消去」しようとすることにより、この課題に対処することを目的としています。
最近、モデルプロバイダーは、画像とテキストプロンプトを入力として受け入れる画像編集サービスの提供を開始し、プロンプトで指定されたように変更された画像を作成します。
これらは、画像からイメージ（I2I）モデルとして知られています。
この論文では、最初にI2Iモデルを使用して、今日の最新のCRTが実際に容認できない概念を消去しないことを経験的に実証します。
したがって、既存のCRTは、T2Iパイプラインで不要な概念を削除する実証済みの能力にもかかわらず、新たなI2Iシナリオでは効果がない可能性が高く、T2IとI2Iの設定間のこの矛盾を理解する必要性を強調しています。
次に、適切なCRTは、容認できない概念を置き換えますが、入力で指定された他の概念を生成モデルに保存する必要があると主張します。
これを忠実に呼びます。
CRTの以前の研究は、容認できない概念の場合、忠実度を無視してきました。
最後に、有効性と忠実度の両方を達成するために、ターゲットを絞った画像編集技術の使用を提案します。
私たちはそのような技術と反イマージャーを提示し、その生存率を示します。

要約(オリジナル)

Generative models, particularly diffusion-based text-to-image (T2I) models, have demonstrated astounding success. However, aligning them to avoid generating content with unacceptable concepts (e.g., offensive or copyrighted content, or celebrity likenesses) remains a significant challenge. Concept replacement techniques (CRTs) aim to address this challenge, often by trying to ‘erase’ unacceptable concepts from models. Recently, model providers have started offering image editing services which accept an image and a text prompt as input, to produce an image altered as specified by the prompt. These are known as image-to-image (I2I) models. In this paper, we first use an I2I model to empirically demonstrate that today’s state-of-the-art CRTs do not in fact erase unacceptable concepts. Existing CRTs are thus likely to be ineffective in emerging I2I scenarios, despite their proven ability to remove unwanted concepts in T2I pipelines, highlighting the need to understand this discrepancy between T2I and I2I settings. Next, we argue that a good CRT, while replacing unacceptable concepts, should preserve other concepts specified in the inputs to generative models. We call this fidelity. Prior work on CRTs have neglected fidelity in the case of unacceptable concepts. Finally, we propose the use of targeted image-editing techniques to achieve both effectiveness and fidelity. We present such a technique, AntiMirror, and demonstrate its viability.

arxiv情報

著者	Anudeep Das,Gurjot Singh,Prach Chantasantitam,N. Asokan
発行日	2025-06-10 17:02:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CR, cs.CV | コメントを受け付けていません

SDTagNet: Leveraging Text-Annotated Navigation Maps for Online HD Map Construction

投稿日: 2025年6月11日作成者: jarxiv

要約

自動運転車は、安全に動作するために詳細かつ正確な環境情報に依存しています。
高解像度（HD）マップは有望なソリューションを提供しますが、メンテナンスコストの高いコストは、スケーラブルな展開に対する大きな障壁をもたらします。
この課題は、ライブセンサーデータからローカルHDマップを生成するオンラインHDマップ構築方法によって対処されています。
ただし、これらの方法は、オンボードセンサーの短い知覚範囲によって本質的に制限されています。
この制限を克服し、一般的なパフォーマンスを向上させるために、最近のアプローチでは、以前の標準定義（SD）マップの使用を検討しました。
openstreetMapなどの広く利用可能なSDマップの情報を完全に利用して、遠い範囲検出精度を強化する最初のオンラインHDマップ構築方法であるSDTagnetを提案します。
私たちのアプローチでは、2つの重要なイノベーションを紹介します。
まず、以前の作業とは対照的に、手動で選択されたクラスを持つポリラインSDマップデータだけでなく、テキストアノテーションの形で追加のセマンティック情報を組み込みます。
このようにして、SDベクトルマップトークンをNLP由来の機能で濃縮し、事前定義された仕様または徹底的なクラスの分類法への依存度を排除します。
次に、すべてのタイプのマップ要素を均一に統合するために、直交要素識別子とともにポイントレベルのSDマップエンコーダーを導入します。
Argoverse 2およびNuscenesの実験は、これがマップ認識パフォーマンスを最大+5.9マップ（ +45％）W.R.T.
プライアーのないマップ構造と+3.2マップ（ +20％）W.R.T。
すでにSDマッププライアーを使用している以前のアプローチ。
コードはhttps://github.com/immel-f/sdtagnetで入手できます

要約(オリジナル)

Autonomous vehicles rely on detailed and accurate environmental information to operate safely. High definition (HD) maps offer a promising solution, but their high maintenance cost poses a significant barrier to scalable deployment. This challenge is addressed by online HD map construction methods, which generate local HD maps from live sensor data. However, these methods are inherently limited by the short perception range of onboard sensors. To overcome this limitation and improve general performance, recent approaches have explored the use of standard definition (SD) maps as prior, which are significantly easier to maintain. We propose SDTagNet, the first online HD map construction method that fully utilizes the information of widely available SD maps, like OpenStreetMap, to enhance far range detection accuracy. Our approach introduces two key innovations. First, in contrast to previous work, we incorporate not only polyline SD map data with manually selected classes, but additional semantic information in the form of textual annotations. In this way, we enrich SD vector map tokens with NLP-derived features, eliminating the dependency on predefined specifications or exhaustive class taxonomies. Second, we introduce a point-level SD map encoder together with orthogonal element identifiers to uniformly integrate all types of map elements. Experiments on Argoverse 2 and nuScenes show that this boosts map perception performance by up to +5.9 mAP (+45%) w.r.t. map construction without priors and up to +3.2 mAP (+20%) w.r.t. previous approaches that already use SD map priors. Code is available at https://github.com/immel-f/SDTagNet

arxiv情報

著者	Fabian Immel,Jan-Hendrik Pauls,Richard Fehler,Frank Bieder,Jonas Merkert,Christoph Stiller
発行日	2025-06-10 17:16:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.RO | コメントを受け付けていません

Do MIL Models Transfer?

投稿日: 2025年6月11日作成者: jarxiv

要約

複数のインスタンス学習（MIL）は、ギガピクセル組織画像から臨床的に意味のあるスライドレベルの埋め込みを生成するための計算病理学（CPATH）の基礎的アプローチです。
ただし、MILはしばしば、小規模で弱く監視されている臨床データセットに苦労しています。
NLPや従来のコンピュータービジョンなどのフィールドとは対照的に、転送学習がデータ不足に対処するために広く使用されているため、MILモデルの移動性はよく理解されていません。
この研究では、形態学的および分子サブタイプの予測の21の前提課題にわたって11のモデルを評価することにより、前処理されたMILモデルの転送学習能力を体系的に評価します。
私たちの結果は、ターゲットタスクとは異なる臓器で訓練されている場合でも、ゼロからトレーニングされたモデルよりも一貫してパフォーマンスを上げることができたことを示しています。
さらに、Pancancerデータセットでの事前供給により、臓器やタスク全体で強力な一般化が可能になり、事前に少ない前のデータを使用しながら、スライドファンデーションモデルを上回ります。
これらの調査結果は、MILモデルの堅牢な適応性を強調し、CPATHのパフォーマンスを高めるために転送学習を活用することの利点を示しています。
最後に、https://github.com/mahmoodlab/mil-labで入手可能な人気のあるCパスタスク上のMILモデルの実装と、事前に守られたモデルの重みのコレクションを標準化するリソースを提供します。

要約(オリジナル)

Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at https://github.com/mahmoodlab/MIL-Lab

arxiv情報

著者	Daniel Shao,Richard J. Chen,Andrew H. Song,Joel Runevic,Ming Y. Lu,Tong Ding,Faisal Mahmood
発行日	2025-06-10 17:50:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

Fine-Grained Spatially Varying Material Selection in Images

投稿日: 2025年6月11日作成者: jarxiv

要約

選択は、多くの画像編集プロセスの最初のステップであり、共通のモダリティを共有するすべてのピクセルのより速く、よりシンプルな変更を可能にします。
この作業では、画像に材料選択の方法を提示します。照明と反射率のバリエーションに堅牢で、ダウンストリームの編集タスクに使用できます。
Vision Transformer（VIT）モデルに依存し、選択の機能を活用して、以前の方法よりも細かく安定した選択結果をもたらす多解像度処理戦略を提案しています。
さらに、テクスチャとサブテクスチャの2つのレベルで選択を可能にします。テクスチャレベルとサブテクスチャレベルの両方で、800,000を超える合成画像の密な注釈を含む新しい2レベルの材料選択（DUMAS）データセットを活用します。

要約(オリジナル)

Selection is the first step in many image editing processes, enabling faster and simpler modifications of all pixels sharing a common modality. In this work, we present a method for material selection in images, robust to lighting and reflectance variations, which can be used for downstream editing tasks. We rely on vision transformer (ViT) models and leverage their features for selection, proposing a multi-resolution processing strategy that yields finer and more stable selection results than prior methods. Furthermore, we enable selection at two levels: texture and subtexture, leveraging a new two-level material selection (DuMaS) dataset which includes dense annotations for over 800,000 synthetic images, both on the texture and subtexture levels.

arxiv情報

著者	Julia Guerrero-Viu,Michael Fischer,Iliyan Georgiev,Elena Garces,Diego Gutierrez,Belen Masia,Valentin Deschaintre
発行日	2025-06-10 17:50:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.GR | コメントを受け付けていません

DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging

投稿日: 2025年6月11日作成者: jarxiv

要約

医療イメージングなどの安全性が批判的なドメインにおける機械学習（ML）モデルの安全な展開には、信頼できない予測を防ぐために、分散（OOD）検出として知られるトレーニング中に見られない特性を持つ入力を検出する必要があります。
展開後の効果的なOOD検出は、トレーニングデータへのアクセスから恩恵を受ける可能性があり、テストサンプルとトレーニングデータの分布を直接比較して違いを特定できます。
ただし、最先端のOOD検出方法は、展開後にトレーニングデータを破棄するか、テストサンプルとトレーニングデータが集中的に保存されていると想定しています。
これは、展開されたモデルを使用したトレーニングデータの配送は、トレーニングデータベースの規模と独自またはプライバシーの制約のために通常不可能であるためです。
バイナリ分類タスクを解決することにより、ターゲットテストサンプルをトレーニングデータから分離することの難しさを定量化するOOD検出フレームワークであるIsolation Networkを紹介します。
次に、分散型分離ネットワーク（DONON）を提案します。これにより、トレーニングと展開のリモート計算ノード間でモデルパラメーターのみを交換することにより、データ共有が不可能な場合にトレーニングとテストデータの比較が可能になります。
さらに、ターゲットサンプルを予測クラスのトレーニングデータと比較して、クラスの条件で名誉を拡張します。
12のood検出タスクにわたって、4つの医療画像データセット（皮膚科、胸部X線、乳房超音波、組織病理学）で名誉を評価します。
DONONは、データプリバシーを尊重しながら、既存の方法に対して好意的に機能します。
この分散型OOD検出フレームワークは、ML開発者がモデルとともに提供できる新しいタイプのサービスの方法を開きます。OOD検出サービスのトレーニングデータのリモートで安全な利用を提供します。
コードは、受け入れられると利用可能になります：*****

要約(オリジナル)

Safe deployment of machine learning (ML) models in safety-critical domains such as medical imaging requires detecting inputs with characteristics not seen during training, known as out-of-distribution (OOD) detection, to prevent unreliable predictions. Effective OOD detection after deployment could benefit from access to the training data, enabling direct comparison between test samples and the training data distribution to identify differences. State-of-the-art OOD detection methods, however, either discard training data after deployment or assume that test samples and training data are centrally stored together, an assumption that rarely holds in real-world settings. This is because shipping training data with the deployed model is usually impossible due to the size of training databases, as well as proprietary or privacy constraints. We introduce the Isolation Network, an OOD detection framework that quantifies the difficulty of separating a target test sample from the training data by solving a binary classification task. We then propose Decentralized Isolation Networks (DIsoN), which enables the comparison of training and test data when data-sharing is impossible, by exchanging only model parameters between the remote computational nodes of training and deployment. We further extend DIsoN with class-conditioning, comparing a target sample solely with training data of its predicted class. We evaluate DIsoN on four medical imaging datasets (dermatology, chest X-ray, breast ultrasound, histopathology) across 12 OOD detection tasks. DIsoN performs favorably against existing methods while respecting data-privacy. This decentralized OOD detection framework opens the way for a new type of service that ML developers could provide along with their models: providing remote, secure utilization of their training data for OOD detection services. Code will be available upon acceptance at: *****

arxiv情報

著者	Felix Wagner,Pramit Saha,Harry Anthony,J. Alison Noble,Konstantinos Kamnitsas
発行日	2025-06-10 17:52:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV, cs.LG, I.2.0 | コメントを受け付けていません

Diffuse and Disperse: Image Generation with Representation Regularization

投稿日: 2025年6月11日作成者: jarxiv

要約

過去10年間の拡散ベースの生成モデルの開発は、表現学習の進歩とは独立して主に進行してきました。
これらの拡散モデルは通常、回帰ベースの目標に依存しており、一般に明示的な正則化がありません。
この作業では、拡散ベースの生成モデルを効果的に改善する単純なプラグアンドプレイライザーである\ TextIT {分散型損失}を提案します。
私たちの損失関数は、内部表現が隠された空間で分散することを奨励します。これは、対照的な自己監視学習に類似しており、正のサンプルペアを必要としないため、回帰に使用されるサンプリングプロセスを妨げないという重要な区別があります。
最近の表現調整方法（Repa）と比較して、私たちのアプローチは自己完結型でミニマリストであり、トレーニング前、追加のパラメーター、外部データも必要ありません。
さまざまなモデル全体でImagENetデータセットの分散損失を評価し、広く使用されているベースラインよりも一貫した改善を報告します。
私たちの仕事が、生成モデリングと表現学習の間のギャップを埋めるのに役立つことを願っています。

要約(オリジナル)

The development of diffusion-based generative models over the past decade has largely proceeded independently of progress in representation learning. These diffusion models typically rely on regression-based objectives and generally lack explicit regularization. In this work, we propose \textit{Dispersive Loss}, a simple plug-and-play regularizer that effectively improves diffusion-based generative models. Our loss function encourages internal representations to disperse in the hidden space, analogous to contrastive self-supervised learning, with the key distinction that it requires no positive sample pairs and therefore does not interfere with the sampling process used for regression. Compared to the recent method of representation alignment (REPA), our approach is self-contained and minimalist, requiring no pre-training, no additional parameters, and no external data. We evaluate Dispersive Loss on the ImageNet dataset across a range of models and report consistent improvements over widely used and strong baselines. We hope our work will help bridge the gap between generative modeling and representation learning.

arxiv情報

著者	Runqian Wang,Kaiming He
発行日	2025-06-10 17:53:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CV, cs.LG | コメントを受け付けていません

SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

投稿日: 2025年6月11日作成者: jarxiv

要約

マルチモーダルモデルの最近の進歩にもかかわらず、3D空間推論は、最先端のオープンソースおよび独自のモデルにとって困難なタスクです。
最近の研究では、データ駆動型のアプローチを調査し、3D関連の視覚的質問データに関する微調整モデルによる空間推論パフォーマンスの強化を実現しています。
ただし、これらの方法は通常、暗黙の方法で空間的推論を実行し、長い考え方の推論でさえ、人間にとって些細な質問に失敗することがよくあります。
この作業では、複数の段階の間で共有された明示的な3D表現（3Dの知覚、計算、および推論で共有された3D空間推論）に対処する新しい大型視覚言語モデル（LVLM）であるSpatialReasonerを紹介します。
明示的な3D表現は、高度な3D空間推論をサポートし、新しい質問タイプの一般化能力を向上させるコヒーレントインターフェイスを提供します。
さらに、SpatialReasonerのマルチステップ推論痕跡の明示的な3D表現を分析することにより、事実上の誤りを研究し、現在のLVLMの重要な欠点を特定します。
結果は、私たちの空間的季節がさまざまな空間推論ベンチマークでパフォーマンスを向上させ、3DSRBenchでジェミニ2.0を9.2％上回るパフォーマンスを達成し、新しい3D空間推論の質問を評価する際によりよく一般化することを示しています。
私たちの研究は、大規模な言語モデルの強力な推論能力を備えた、以前の視覚基盤モデルの3D解析機能を橋渡しし、3D空間的推論の新しい方向性を開きます。

要約(オリジナル)

Despite recent advances on multi-modal models, 3D spatial reasoning remains a challenging task for state-of-the-art open-source and proprietary models. Recent studies explore data-driven approaches and achieve enhanced spatial reasoning performance by fine-tuning models on 3D-related visual question-answering data. However, these methods typically perform spatial reasoning in an implicit manner and often fail on questions that are trivial to humans, even with long chain-of-thought reasoning. In this work, we introduce SpatialReasoner, a novel large vision-language model (LVLM) that addresses 3D spatial reasoning with explicit 3D representations shared between multiple stages–3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and improves the generalization ability to novel question types. Furthermore, by analyzing the explicit 3D representations in multi-step reasoning traces of SpatialReasoner, we study the factual errors and identify key shortcomings of current LVLMs. Results show that our SpatialReasoner achieves improved performance on a variety of spatial reasoning benchmarks, outperforming Gemini 2.0 by 9.2% on 3DSRBench, and generalizes better when evaluating on novel 3D spatial reasoning questions. Our study bridges the 3D parsing capabilities of prior visual foundation models with the powerful reasoning abilities of large language models, opening new directions for 3D spatial reasoning.

arxiv情報

著者	Wufei Ma,Yu-Cheng Chou,Qihao Liu,Xingrui Wang,Celso de Melo,Jianwen Xie,Alan Yuille
発行日	2025-06-10 17:53:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models

投稿日: 2025年6月11日作成者: jarxiv

要約

人間は自然に3Dの空間的関係を理解し、さまざまな方向からの車両の衝突を予測するような複雑な推論を可能にします。
ただし、現在の大規模なマルチモーダルモデル（LMM）は、3D空間推論のこの能力の欠如です。
この制限は、3Dトレーニングデータの希少性と、現在のモデルのバイアスが2Dデータに向けて設計されています。
この論文では、3D情報データ、アーキテクチャ、トレーニングセットアップの影響を体系的に研究し、高度な3D空間推論能力を備えた大規模なマルチモーダルモデルであるSpatialllmを導入します。
データの制限に対処するために、2種類の3D情報トレーニングデータセットを開発します。（1）オブジェクトの3D位置と方向に焦点を当てた3D情報プロービングデータ、および（2）複雑な空間関係のための3D情報の会話データ。
特に、私たちは、実際の画像に3Dオリエンテーション関係を組み込んだVQAデータをキュレートする最初の人物です。
さらに、これら2種類のトレーニングデータをLMMSのアーキテクチャおよびトレーニングデザインと体系的に統合し、優れた3D推論機能を達成することを目的とした最適な設計のロードマップを提供します。
Spatialllmは、GPT-4Oのパフォーマンスを8.7％超えて、非常に能力のある3D情報の推論に向けて機械を進めています。
私たちの体系的な経験的設計と結果として生じる調査結果は、この方向における将来の研究のための貴重な洞察を提供します。
プロジェクトページは、https：//3d-spatial-rasining.github.io/spatial-llm/で入手できます。

要約(オリジナル)

Humans naturally understand 3D spatial relationships, enabling complex reasoning like predicting collisions of vehicles from different directions. Current large multimodal models (LMMs), however, lack of this capability of 3D spatial reasoning. This limitation stems from the scarcity of 3D training data and the bias in current model designs toward 2D data. In this paper, we systematically study the impact of 3D-informed data, architecture, and training setups, introducing SpatialLLM, a large multi-modal model with advanced 3D spatial reasoning abilities. To address data limitations, we develop two types of 3D-informed training datasets: (1) 3D-informed probing data focused on object’s 3D location and orientation, and (2) 3D-informed conversation data for complex spatial relationships. Notably, we are the first to curate VQA data that incorporate 3D orientation relationships on real images. Furthermore, we systematically integrate these two types of training data with the architectural and training designs of LMMs, providing a roadmap for optimal design aimed at achieving superior 3D reasoning capabilities. Our SpatialLLM advances machines toward highly capable 3D-informed reasoning, surpassing GPT-4o performance by 8.7%. Our systematic empirical design and the resulting findings offer valuable insights for future research in this direction. Our project page is available at: https://3d-spatial-reasoning.github.io/spatial-llm/

arxiv情報

著者	Wufei Ma,Luoxin Ye,Celso M de Melo,Jieneng Chen,Alan Yuille
発行日	2025-06-10 17:54:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.CV | コメントを受け付けていません

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント