SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

要約

空間関係の理解と推論は、ビジュアル質問応答 (VQA) とロボット工学の基本的な機能です。
ビジョン言語モデル (VLM) は、特定の VQA ベンチマークで顕著なパフォーマンスを示していますが、距離やサイズの違いなどの物理オブジェクトの定量的な関係を認識するなど、3D 空間推論の機能がまだ不足しています。
私たちは、VLM の空間推論能力が限られているのは、トレーニングデータに 3D 空間知識が不足しているためであると仮説を立て、インターネットスケールの空間推論データを使用して VLM をトレーニングすることで、この問題を解決することを目指しています。
この目的を達成するために、このアプローチを容易にするシステムを紹介します。
まず、1,000 万枚の実世界画像上で最大 20 億個の VQA サンプルをスケールできる自動 3D 空間 VQA データ生成フレームワークを開発します。
次に、データ品質、トレーニングパイプライン、VLM アーキテクチャなど、トレーニングレシピ内のさまざまな要素を調査します。
私たちの研究は、メートル空間における初のインターネットスケールの 3D 空間推論データセットを特徴としています。
このようなデータで VLM をトレーニングすることにより、定性的および定量的な空間 VQA の能力が大幅に向上します。
最後に、この VLM がその定量的推定機能により、思考連鎖の空間推論とロボット工学における新しい下流アプリケーションを可能にすることを実証します。
プロジェクト Web サイト: https://spatial-vlm.github.io/

要約(オリジナル)

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs’ limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/

arxiv情報

著者	Boyuan Chen,Zhuo Xu,Sean Kirmani,Brian Ichter,Danny Driess,Pete Florence,Dorsa Sadigh,Leonidas Guibas,Fei Xia
発行日	2024-01-22 18:01:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー