Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

要約

ビジョン言語モデル（VLM）は、インターネットスケールの画像テキストコーパスを通じて、実際の知識と一般的な推論能力を獲得します。
シーンの理解とタスク計画でロボットシステムを強化し、ロボットの軌跡データで訓練された視覚運動ポリシーを支援することができます。
リバースパラダイムを調査します。これは、リッチでリアルなマルチモーダルロボット軌道データを使用して、VLMSを強化および評価します。
この論文では、VLMSの視覚的な質問（VQA）データセット生成フレームワークであるRobo2VLMを紹介します。
人間のテレ操作ロボットの軌跡を考えると、Robo2VLMは、エンド効果のポーズ、グリッパーアパーチャ、フォースセンシングなど、非視覚的および非記述的な感覚モダリティから根真実を導き出します。
これらのモダリティに基づいて、ロボット軌道を一連の操作フェーズにセグメント化します。
各フェーズで、Robo2VLMはシーンとインタラクションの理解を使用して、ロボット、タスク目標、およびターゲットオブジェクトの3Dプロパティを識別します。
プロパティは、代表的なVQAクエリ（テクスチャの多肢選択式質問を含む画像）を生成するために使用されます。
176kの実際のロボット軌道からの463の異なるシーンと3,396のロボット操作タスクをカバーする684,710の質問を備えた大規模なワイルドデータセットであるRobo2VLM-1をキュレートします。
結果は、Robo2VLM-1が空間および相互作用の推論におけるVLM機能をベンチマークおよび改善できることを示唆しています。

要約(オリジナル)

Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm – using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of the robot, task goal, and the target object. The properties are used to generate representative VQA queries – images with textural multiple-choice questions – based on spatial, goal-conditioned, and interaction reasoning question templates. We curate Robo2VLM-1, a large-scale in-the-wild dataset with 684,710 questions covering 463 distinct scenes and 3,396 robotic manipulation tasks from 176k real robot trajectories. Results suggest that Robo2VLM-1 can benchmark and improve VLM capabilities in spatial and interaction reasoning.

arxiv情報

著者	Kaiyuan Chen,Shuangyu Xie,Zehan Ma,Ken Goldberg
発行日	2025-05-21 13:42:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー