From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

要約

LVLMSの最近の進歩により、視覚言語の理解が向上しましたが、彼らはまだ空間的認識に苦労しており、複雑な3Dシーンについて推論する能力を制限しています。
3D表現をモデルに組み込み、空間的理解を改善する以前のアプローチとは異なり、空間的に関連する画像データを活用することにより、VLMの可能性を解き放つことを目指しています。
この目的のために、3Dグラウンドトゥルースを備えたシーンデータ上に構築された新しい2D空間データ生成と注釈パイプラインを紹介します。
このパイプラインにより、基本的な認識タスクからより複雑な推論タスクに至るまで、さまざまな空間タスクのセットを作成できます。
このパイプラインを活用して、複数のパブリックデータセットで数千のシーンから生成される大規模なデータセットであるSPAR-7Mを構築します。
さらに、既存の空間ベンチマークと比較して空間機能のより包括的な評価を提供するように設計されたベンチマークであるSpar-Benchを紹介し、シングルビューとマルチビューの両方の入力をサポートします。
SPAR-7Mと大規模な2Dデータセットの両方でのトレーニングにより、モデルは2D空間ベンチマークで最先端のパフォーマンスを実現できます。
3Dタスク固有のデータセットでさらに微調整すると、競争結果が得られ、空間推論の強化におけるデータセットの有効性を強調します。

要約(オリジナル)

Recent advances in LVLMs have improved vision-language understanding, but they still struggle with spatial perception, limiting their ability to reason about complex 3D scenes. Unlike previous approaches that incorporate 3D representations into models to improve spatial understanding, we aim to unlock the potential of VLMs by leveraging spatially relevant image data. To this end, we introduce a novel 2D spatial data generation and annotation pipeline built upon scene data with 3D ground-truth. This pipeline enables the creation of a diverse set of spatial tasks, ranging from basic perception tasks to more complex reasoning tasks. Leveraging this pipeline, we construct SPAR-7M, a large-scale dataset generated from thousands of scenes across multiple public datasets. In addition, we introduce SPAR-Bench, a benchmark designed to offer a more comprehensive evaluation of spatial capabilities compared to existing spatial benchmarks, supporting both single-view and multi-view inputs. Training on both SPAR-7M and large-scale 2D datasets enables our models to achieve state-of-the-art performance on 2D spatial benchmarks. Further fine-tuning on 3D task-specific datasets yields competitive results, underscoring the effectiveness of our dataset in enhancing spatial reasoning.

arxiv情報

著者	Jiahui Zhang,Yurui Chen,Yanpeng Zhou,Yueming Xu,Ze Huang,Jilin Mei,Junhui Chen,Yu-Jie Yuan,Xinyue Cai,Guowei Huang,Xingyue Quan,Hang Xu,Li Zhang
発行日	2025-05-09 09:48:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー