Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

要約

この研究では、ロボットに物理的に根拠のあるタスク計画の機能を組み込むことに興味があります。
最近の進歩により、大規模言語モデル (LLM) は、ロボットのタスク、特に推論と計画に役立つ広範な知識を備えていることが示されています。
しかし、LLM は世界の基盤が欠如しており、環境情報を認識するために外部のアフォーダンスモデルに依存しているため、LLM と共同で推論することができません。
私たちは、タスクプランナーは本質的に根拠のある統合されたマルチモーダルシステムであるべきだと主張します。
この目的を達成するために、ビジョン言語モデル (VLM) を活用して一連の実行可能なステップを生成する、長期的なロボット計画の新しいアプローチであるロボット視覚言語計画 (ViLa) を紹介します。
ViLa は、知覚データを推論および計画プロセスに直接統合し、空間レイアウトやオブジェクトの属性など、視覚世界の常識知識を深く理解できるようにします。
また、柔軟なマルチモーダルな目標指定もサポートしており、視覚的なフィードバックも自然に組み込まれています。
実際のロボット環境とシミュレート環境の両方で実施された広範な評価では、既存の LLM ベースのプランナーに対する ViLa の優位性が実証され、オープンワールドのさまざまな操作タスクにおけるその有効性が強調されています。

要約(オリジナル)

In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa’s superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.

arxiv情報

著者	Yingdong Hu,Fanqi Lin,Tong Zhang,Li Yi,Yang Gao
発行日	2023-11-29 17:46:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー