Guiding Long-Horizon Task and Motion Planning with Vision Language Models

要約

視覚言語モデル(VLM)は、ゴール、コンテキスト、シーンの画像、および任意のプランニング制約があれば、もっともらしい高レベルのプランを生成することができる。しかし、予測された動作が、特定のロボットの実施形態に対して幾何学的および運動学的に実行可能であるという保証はありません。その結果、物体にアクセスするために引き出しを開けるなど、多くの前提ステップが計画で省略されることが多い。ロボットのタスクプランナーやモーションプランナーは、動作の幾何学的な実現可能性を尊重し、物理的に必要な動作を挿入する動作軌道を生成することができるが、常識的な知識を必要とし、多くの変数で構成される大きな状態空間を含む日常的な問題には拡張できない。我々はVLM-TAMPを提案する。VLM-TAMPはVLMを活用し、タスクプランナーとモーションプランナーを導く、意味的に意味のある、地平線を縮小する中間サブゴールを生成する階層的プランニングアルゴリズムである。サブゴールやアクションが改良できない場合は、VLMに再度問い合わせ、再計画を行う。VLM-TAMPをキッチンタスクで評価したところ、ロボットは30～50のアクションを連続して実行し、最大21のオブジェクトと相互作用する必要がある調理目標を達成しなければならなかった。その結果、VLM-TAMPは、VLMが生成した行動シーケンスを硬直的かつ独立に実行するベースラインを、成功率（50～100％対0％）および平均タスク完了率（72～100％対15～45％）の両面で大幅に上回った。詳細はプロジェクトサイトhttps://zt-yang.github.io/vlm-tamp-robot/。

要約(オリジナル)

Vision-Language Models (VLM) can generate plausible high-level plans when prompted with a goal, the context, an image of the scene, and any planning constraints. However, there is no guarantee that the predicted actions are geometrically and kinematically feasible for a particular robot embodiment. As a result, many prerequisite steps such as opening drawers to access objects are often omitted in their plans. Robot task and motion planners can generate motion trajectories that respect the geometric feasibility of actions and insert physically necessary actions, but do not scale to everyday problems that require common-sense knowledge and involve large state spaces comprised of many variables. We propose VLM-TAMP, a hierarchical planning algorithm that leverages a VLM to generate goth semantically-meaningful and horizon-reducing intermediate subgoals that guide a task and motion planner. When a subgoal or action cannot be refined, the VLM is queried again for replanning. We evaluate VLM- TAMP on kitchen tasks where a robot must accomplish cooking goals that require performing 30-50 actions in sequence and interacting with up to 21 objects. VLM-TAMP substantially outperforms baselines that rigidly and independently execute VLM-generated action sequences, both in terms of success rates (50 to 100% versus 0%) and average task completion percentage (72 to 100% versus 15 to 45%). See project site https://zt-yang.github.io/vlm-tamp-robot/ for more information.

arxiv情報

著者	Zhutian Yang,Caelan Garrett,Dieter Fox,Tomás Lozano-Pérez,Leslie Pack Kaelbling
発行日	2024-10-03 04:14:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Guiding Long-Horizon Task and Motion Planning with Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー