Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation

要約

ロボット操作は、多様な言語指示によって指定された目に見えないオブジェクト、環境、およびタスク全体で一般化する上で重要な課題に直面しています。
一般化能力を改善するために、最近の研究では、計画と行動の実行のために大規模な言語モデル（LLM）が組み込まれています。
有望である一方で、これらの方法は視覚環境で根拠のある計画を生成するのに不足していることがよくあります。
ロボット操作のためにLLMSで視覚的な指導チューニングを実行する努力がなされていますが、既存の方法は通常、シングルビュー画像入力によって制約され、正確なオブジェクトの接地との闘いがあります。
この作業では、一般化可能なロボット操作のためにLLMSに基づいた新しい接地された視覚言語計画モデルであるGondolaを紹介します。
Gondolaは、ターゲットオブジェクトと場所のインターリーブテキストとセグメンテーションマスクを使用して、マルチビュー画像と履歴計画を作成して、次のアクションプランを作成します。
Gondolaのトレーニングをサポートするために、RLBenchシミュレーター、つまりロボット接地計画、式を参照するマルチビュー、および擬似ホリゾンタスクデータセットを使用して、3種類のデータセットを構築します。
Gondolaは、新しい配置、剛性オブジェクト、明確なオブジェクト、長距離タスクなど、Gembenchデータセットの4つの一般化レベルすべてにわたって、最先端のLLMベースのメソッドよりも優れています。

要約(オリジナル)

Robotic manipulation faces a significant challenge in generalizing across unseen objects, environments and tasks specified by diverse language instructions. To improve generalization capabilities, recent research has incorporated large language models (LLMs) for planning and action execution. While promising, these methods often fall short in generating grounded plans in visual environments. Although efforts have been made to perform visual instructional tuning on LLMs for robotic manipulation, existing methods are typically constrained by single-view image input and struggle with precise object grounding. In this work, we introduce Gondola, a novel grounded vision-language planning model based on LLMs for generalizable robotic manipulation. Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations. To support the training of Gondola, we construct three types of datasets using the RLBench simulator, namely robot grounded planning, multi-view referring expression and pseudo long-horizon task datasets. Gondola outperforms the state-of-the-art LLM-based method across all four generalization levels of the GemBench dataset, including novel placements, rigid objects, articulated objects and long-horizon tasks.

arxiv情報

著者	Shizhe Chen,Ricardo Garcia,Paul Pacaud,Cordelia Schmid
発行日	2025-06-12 20:04:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー