RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

要約

空間的理解とは、ロボットが周囲、環境についての理由を認識し、それと意味的に相互作用できるようにする重要な能力です。
現代のロボット工学では、これらの機能はビジョン言語モデルによってますます提供されています。
ただし、これらのモデルは、トレーニングデータが洗練された空間的理解を欠く汎用画像データセットに基づいているため、空間推論タスクの重要な課題に直面しています。
たとえば、データセットは参照フレームの理解をキャプチャしないことがよくありますが、効果的な空間的推論では、自我、世界、またはオブジェクト中心の視点から推論するかどうかを理解する必要があります。
この問題に対処するために、ロボット工学における空間的理解のための大規模なデータセットであるRobospatialを紹介します。
これは、実際の屋内および卓上シーンで構成され、3Dスキャンとエゴセントリック画像としてキャプチャされ、ロボット工学に関連する豊富な空間情報で注釈が付けられています。
データセットには、1M画像、5K 3Dスキャン、3M注釈付きの空間関係が含まれ、3Dスキャンを使用した2Dエゴセントリック画像のペアリングにより、2Dと3Dの両方の準備ができています。
私たちの実験では、ロボスパシアで訓練されたモデルが、空間アフォーダンス予測、空間関係予測、ロボット操作などの下流タスクのベースラインをアウトパフォーマンスすることを示しています。

要約(オリジナル)

Spatial understanding is a crucial capability that enables robots to perceive their surroundings, reason about their environment, and interact with it meaningfully. In modern robotics, these capabilities are increasingly provided by vision-language models. However, these models face significant challenges in spatial reasoning tasks, as their training data are based on general-purpose image datasets that often lack sophisticated spatial understanding. For example, datasets frequently do not capture reference frame comprehension, yet effective spatial reasoning requires understanding whether to reason from ego-, world-, or object-centric perspectives. To address this issue, we introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics. It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics. The dataset includes 1M images, 5k 3D scans, and 3M annotated spatial relationships, and the pairing of 2D egocentric images with 3D scans makes it both 2D- and 3D- ready. Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.

arxiv情報

著者	Chan Hee Song,Valts Blukis,Jonathan Tremblay,Stephen Tyree,Yu Su,Stan Birchfield
発行日	2025-03-26 07:30:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー