RT-Grasp: Reasoning Tuning Robotic Grasping via Multi-modal Large Language Model

要約

大規模言語モデル (LLM) の最近の進歩により、その驚くべき推論能力が実証され、さまざまな分野に影響を与えるようになりました。
ただし、ロボット工学では、その固有のテキスト出力のため、その使用は主に操作計画タスクに限定されてきました。
この論文では、ロボットタスク、特にロボットによる把握において数値予測を生成するために LLM の推論能力を採用する可能性を調査することで、この制限に対処します。
私たちは、LLM の広範な事前知識と高度な推論能力を活用して、トレーニング中に予測前の推論フェーズを統合する新しい方法である推論チューニングを提案します。
このアプローチにより、特にマルチモーダル機能を備えた LLM は、コンテキストを認識し、会話を通じて適応可能な把握ポーズなどの正確な数値出力を生成できるようになります。
さらに、ロボットによる把握への LLM の適応を容易にするために注意深く厳選された、Reasoning Tuning VLM Grasp データセットを紹介します。
データセットの把握と現実世界の実験の両方に関する広範な検証により、ロボット工学における数値予測タスクに対するマルチモーダル LLM の適応性が強調されています。
これにより、適用範囲が拡大するだけでなく、テキストベースの計画とロボットの直接制御との間のギャップが埋められ、ロボット工学における LLM の可能性が最大化されます。

要約(オリジナル)

Recent advances in Large Language Models (LLMs) have showcased their remarkable reasoning capabilities, making them influential across various fields. However, in robotics, their use has primarily been limited to manipulation planning tasks due to their inherent textual output. This paper addresses this limitation by investigating the potential of adopting the reasoning ability of LLMs for generating numerical predictions in robotics tasks, specifically for robotic grasping. We propose Reasoning Tuning, a novel method that integrates a reasoning phase before prediction during training, leveraging the extensive prior knowledge and advanced reasoning abilities of LLMs. This approach enables LLMs, notably with multi-modal capabilities, to generate accurate numerical outputs like grasp poses that are context-aware and adaptable through conversations. Additionally, we present the Reasoning Tuning VLM Grasp dataset, carefully curated to facilitate the adaptation of LLMs to robotic grasping. Extensive validation on both grasping datasets and real-world experiments underscores the adaptability of multi-modal LLMs for numerical prediction tasks in robotics. This not only expands their applicability but also bridges the gap between text-based planning and direct robot control, thereby maximizing the potential of LLMs in robotics.

arxiv情報

著者	Jinxuan Xu,Shiyu Jin,Yutian Lei,Yuqian Zhang,Liangjun Zhang
発行日	2024-11-07 22:17:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RT-Grasp: Reasoning Tuning Robotic Grasping via Multi-modal Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー