ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

要約

マルチモーダル大規模言語モデル (MLLM) とロボットシステムの統合により、ロボットが自然言語の命令を解釈し、それに基づいて動作する能力が大幅に強化されました。
これらの進歩にもかかわらず、従来の MLLM は通常、一般的な画像とテキストのペアでトレーニングされており、アフォーダンスや物理的知識などの必須のロボット工学の知識が欠如しており、操作タスクでの有効性が妨げられています。
このギャップを埋めるために、視覚的な質問応答形式を通じて MLLM に操作中心の知識を与えるように設計された新しいフレームワークである ManipVQA を紹介します。
このアプローチは、ツールの検出とアフォーダンス認識を包含するだけでなく、物理概念の包括的な理解にも拡張されます。
私たちのアプローチは、インタラクティブなオブジェクトを表示するさまざまな画像セットを収集することから始まります。これにより、ツールオブジェクトの検出、アフォーダンス、物理概念の予測において幅広い課題が生じます。
このロボット固有の知識を MLLM に固有の視覚推論能力とシームレスに統合するために、私たちは統一された VQA フォーマットを採用し、新しいロボットの洞察を組み込みながら元の視覚推論能力を維持する微調整戦略を考案しました。
ロボットシミュレータおよびさまざまなビジョンタスクベンチマークで実施された実証評価により、ManipVQA の堅牢なパフォーマンスが実証されています。
コードとデータセットは https://github.com/SiyuanHuang95/ManipVQA で公開されます。

要約(オリジナル)

The integration of Multimodal Large Language Models (MLLMs) with robotic systems has significantly enhanced the ability of robots to interpret and act upon natural language instructions. Despite these advancements, conventional MLLMs are typically trained on generic image-text pairs, lacking essential robotics knowledge such as affordances and physical knowledge, which hampers their efficacy in manipulation tasks. To bridge this gap, we introduce ManipVQA, a novel framework designed to endow MLLMs with Manipulation-centric knowledge through a Visual Question-Answering format. This approach not only encompasses tool detection and affordance recognition but also extends to a comprehensive understanding of physical concepts. Our approach starts with collecting a varied set of images displaying interactive objects, which presents a broad range of challenges in tool object detection, affordance, and physical concept predictions. To seamlessly integrate this robotic-specific knowledge with the inherent vision-reasoning capabilities of MLLMs, we adopt a unified VQA format and devise a fine-tuning strategy that preserves the original vision-reasoning abilities while incorporating the new robotic insights. Empirical evaluations conducted in robotic simulators and across various vision task benchmarks demonstrate the robust performance of ManipVQA. Code and dataset will be made publicly available at https://github.com/SiyuanHuang95/ManipVQA.

arxiv情報

著者	Siyuan Huang,Iaroslav Ponomarenko,Zhengkai Jiang,Xiaoqi Li,Xiaobin Hu,Peng Gao,Hongsheng Li,Hao Dong
発行日	2024-03-17 17:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー