Solving Robotics Problems in Zero-Shot with Vision-Language Models

要約

ゼロショット体制でロボット工学の問題を解決するためのマルチエージェントビジュアル LLM (VLLM) フレームワークである Wonderful Team を紹介します。
ゼロショットとは、新しい環境に対して、ロボットの環境の画像とタスクの説明を VLLM に供給し、ロボットがタスクを完了するために必要な一連のアクションを VLLM に出力させることを意味します。
ロボット工学における VLLM に関するこれまでの研究は、ロボットデータに基づいて LLM を調整したり、知覚とアクションを生成するための別個のビジョンエンコーダをトレーニングしたりするなど、パイプラインの一部が微調整される設定に主に焦点を当ててきました。
驚くべきことに、最近の VLLM の機能の進歩により、多くのタスクではこの種の微調整が必要なくなる可能性があります。
この研究では、慎重なエンジニアリングにより、高レベルの計画から低レベルの位置抽出とアクションの実行に至るまで、ロボットタスクのあらゆる側面を単一の既製 VLLM で処理できることを示します。
Wonderful Team は、マルチエージェント LLM の最近の進歩に基づいて構築されており、エージェント階層全体でタスクを分割することで、自己修正機能を備え、長期的なタスクであっても効果的に分割して解決することができます。
VIMABench と現実世界のロボット環境での広範な実験により、操作、視覚的目標達成、視覚的推論などのさまざまなロボットタスクをすべてゼロショット方式で処理するシステムの能力が実証されています。
これらの結果は重要な点を強調しています。視覚言語モデルは過去 1 年間で急速に進歩しており、今後のロボット工学の問題のバックボーンとして強く考慮されるべきです。

要約(オリジナル)

We introduce Wonderful Team, a multi-agent visual LLM (VLLM) framework for solving robotics problems in the zero-shot regime. By zero-shot we mean that, for a novel environment, we feed a VLLM an image of the robot’s environment and a description of the task, and have the VLLM output the sequence of actions necessary for the robot to complete the task. Prior work on VLLMs in robotics has largely focused on settings where some part of the pipeline is fine-tuned, such as tuning an LLM on robot data or training a separate vision encoder for perception and action generation. Surprisingly, due to recent advances in the capabilities of VLLMs, this type of fine-tuning may no longer be necessary for many tasks. In this work, we show that with careful engineering, we can prompt a single off-the-shelf VLLM to handle all aspects of a robotics task, from high-level planning to low-level location-extraction and action-execution. Wonderful Team builds on recent advances in multi-agent LLMs to partition tasks across an agent hierarchy, making it self-corrective and able to effectively partition and solve even long-horizon tasks. Extensive experiments on VIMABench and real-world robotic environments demonstrate the system’s capability to handle a variety of robotic tasks, including manipulation, visual goal-reaching, and visual reasoning, all in a zero-shot manner. These results underscore a key point: vision-language models have progressed rapidly in the past year, and should strongly be considered as a backbone for robotics problems going forward.

arxiv情報

著者	Zidan Wang,Rui Shen,Bradly Stadie
発行日	2024-08-23 16:06:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Solving Robotics Problems in Zero-Shot with Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー