Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning

要約

3Dシーンの理解の進歩を奨励しているにもかかわらず、複雑な3D環境で理解と推論が可能な効果的な大規模なマルチモーダルモデル（LMM）を開発することは困難なままです。
以前のほとんどの方法は、通常、3Dポイントと2D画像機能を個別にエンコードし、2Dセマンティクスと3Dオブジェクトプロパティ間の相互作用、および3D環境内の空間的関係を無視します。
この制限は、3Dシーンの包括的な表現を妨げるだけでなく、トレーニングと推論効率も損なうことを妨げます。
これらの課題に対処するために、統一されたインスタンス認識3Dラージマルチモーダルモデル（Inst3D-LMM）を提案して、複数の3Dシーンの理解タスクを同時に処理します。
きめ細かいインスタンスレベルの視覚トークンを取得するために、まず、新しいマルチビュークロスモーダル融合（MCMF）モジュールを導入して、マルチビュー2Dセマンティクスを対応する3D幾何学的特徴に注入します。
シーンレベルの関係認識トークンの場合、3Dインスタンスの空間関係（3D-ISR）モジュールを紹介して、オブジェクト間の複雑なペアワイズ空間関係をキャプチャします。
さらに、その後のタスク固有の微調整なしに、エンドツーエンドのマルチタスク命令チューニングを同時に実行します。
広範な実験は、私たちのアプローチが、3Dシーンの理解、推論、および接地タスク全体で最先端の方法よりも優れていることを示しています。
ソースコードは、https：//github.com/hanxunyu/inst3d-lmmで入手できます

要約(オリジナル)

Despite encouraging progress in 3D scene understanding, it remains challenging to develop an effective Large Multi-modal Model (LMM) that is capable of understanding and reasoning in complex 3D environments. Most previous methods typically encode 3D point and 2D image features separately, neglecting interactions between 2D semantics and 3D object properties, as well as the spatial relationships within the 3D environment. This limitation not only hinders comprehensive representations of 3D scene, but also compromises training and inference efficiency. To address these challenges, we propose a unified Instance-aware 3D Large Multi-modal Model (Inst3D-LMM) to deal with multiple 3D scene understanding tasks simultaneously. To obtain the fine-grained instance-level visual tokens, we first introduce a novel Multi-view Cross-Modal Fusion (MCMF) module to inject the multi-view 2D semantics into their corresponding 3D geometric features. For scene-level relation-aware tokens, we further present a 3D Instance Spatial Relation (3D-ISR) module to capture the intricate pairwise spatial relationships among objects. Additionally, we perform end-to-end multi-task instruction tuning simultaneously without the subsequent task-specific fine-tuning. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods across 3D scene understanding, reasoning and grounding tasks. Source code is available at https://github.com/hanxunyu/Inst3D-LMM

arxiv情報

著者	Hanxun Yu,Wentong Li,Song Wang,Junbo Chen,Jianke Zhu
発行日	2025-06-16 13:53:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inst3D-LMM: Instance-Aware 3D Scene Understanding with Multi-modal Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー