Language-Image Models with 3D Understanding

要約

マルチモーダル大規模言語モデル(MLLM)は、様々な2次元視覚と言語タスクにおいて驚異的な能力を示してきた。我々は、MLLMの知覚能力を拡張し、3次元空間の画像についての根拠付けと推論を行う。そのために、まず、LV3Dと呼ばれる2次元と3次元のための大規模な事前学習データセットを開発する。次に、Cube-LLMと名付けた新しいMLLMを導入し、LV3D上で事前学習を行う。我々は、3Dに特化したアーキテクチャ設計や学習目的なしに、純粋なデータスケーリングが強力な3D知覚能力を生み出すことを示す。Cube-LLMは、LLMに類似した興味深い特性を示す。(1) Cube-LLMは、2次元の文脈情報から3次元の理解を向上させるために、思考の連鎖プロンプトを適用できる。(2) Cube-LLMは複雑で多様な指示に従うことができ、多様な入出力形式に適応できる。(3) Cube-LLMは、専門家から2次元の箱や3次元の箱の候補の集合を視覚的に促すことができる。屋外ベンチマークの実験により、Cube-LLMは、3D根拠のある推論のTalk2CarデータセットでAP-BEVの21.3ポイント、運転シナリオに関する複雑な推論のDriveLMデータセットで17.7ポイント、それぞれ既存のベースラインを大幅に上回ることが実証された。また、Cube-LLMは、2次元接地推論のrefCOCOのような一般的なMLLMベンチマークにおいても平均87.0点の競争力を示し、複雑な推論のVQAv2, GQA, SQA, POPEのような視覚的質問応答ベンチマークにおいても競争力を示す。我々のプロジェクトはhttps://janghyuncho.github.io/Cube-LLM。

要約(オリジナル)

Multi-modal large language models (MLLMs) have shown incredible capabilities in a variety of 2D vision and language tasks. We extend MLLMs’ perceptual capabilities to ground and reason about images in 3-dimensional space. To that end, we first develop a large-scale pre-training dataset for 2D and 3D called LV3D by combining multiple existing 2D and 3D recognition datasets under a common task formulation: as multi-turn question-answering. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective. Cube-LLM exhibits intriguing properties similar to LLMs: (1) Cube-LLM can apply chain-of-thought prompting to improve 3D understanding from 2D context information. (2) Cube-LLM can follow complex and diverse instructions and adapt to versatile input and output formats. (3) Cube-LLM can be visually prompted such as 2D box or a set of candidate 3D boxes from specialists. Our experiments on outdoor benchmarks demonstrate that Cube-LLM significantly outperforms existing baselines by 21.3 points of AP-BEV on the Talk2Car dataset for 3D grounded reasoning and 17.7 points on the DriveLM dataset for complex reasoning about driving scenarios, respectively. Cube-LLM also shows competitive results in general MLLM benchmarks such as refCOCO for 2D grounding with (87.0) average score, as well as visual question answering benchmarks such as VQAv2, GQA, SQA, POPE, etc. for complex reasoning. Our project is available at https://janghyuncho.github.io/Cube-LLM.

arxiv情報

著者	Jang Hyun Cho,Boris Ivanovic,Yulong Cao,Edward Schmerling,Yue Wang,Xinshuo Weng,Boyi Li,Yurong You,Philipp Krähenbühl,Yan Wang,Marco Pavone
発行日	2024-05-06 17:57:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Language-Image Models with 3D Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー