Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints

要約

ビジョン言語モデル (VLM) など、インターネットスケールのデータでトレーニングされた基盤モデルは、視覚的な質問応答などの常識を伴うタスクの実行に優れています。
これらのモデルは、その優れた機能にもかかわらず、現時点では、複雑で正確な連続推論を必要とする困難なロボット操作問題に直接適用することはできません。
タスクアンドモーションプランニング (TAMP) システムは、従来の原始的なロボット操作を組み合わせることで、長期にわたる高次元の連続システムを制御できます。
ただし、これらのシステムでは、ロボットがその環境にどのような影響を与えることができるかについての詳細なモデルが必要であり、ロボットが人間の新しい目的、たとえば任意の自然言語の目的を直接解釈して対処することができません。
私たちは、TAMP がオープンワールドの概念について推論できるようにする、離散的かつ連続的な言語パラメーター化された制約を VLM に生成させることで、TAMP システム内に VLM を展開することを提案します。
具体的には、TAMP システムが満たそうとする従来の操作制約を強化するために、TAMP システムの離散時間検索と VLM 連続制約解釈を制約する VLM 部分計画のアルゴリズムを提案します。
我々は、現実世界のロボットを含む 2 つのロボットの実施形態について、いくつかの操作タスクにわたってアプローチを実証します。この場合、所望の目的は言語のみを介して伝達されます。

要約(オリジナル)

Foundation models trained on internet-scale data, such as Vision-Language Models (VLMs), excel at performing tasks involving common sense, such as visual question answering. Despite their impressive capabilities, these models cannot currently be directly applied to challenging robot manipulation problems that require complex and precise continuous reasoning. Task and Motion Planning (TAMP) systems can control high-dimensional continuous systems over long horizons through combining traditional primitive robot operations. However, these systems require detailed model of how the robot can impact its environment, preventing them from directly interpreting and addressing novel human objectives, for example, an arbitrary natural language goal. We propose deploying VLMs within TAMP systems by having them generate discrete and continuous language-parameterized constraints that enable TAMP to reason about open-world concepts. Specifically, we propose algorithms for VLM partial planning that constrain a TAMP system’s discrete temporal search and VLM continuous constraints interpretation to augment the traditional manipulation constraints that TAMP systems seek to satisfy. We demonstrate our approach on two robot embodiments, including a real world robot, across several manipulation tasks, where the desired objectives are conveyed solely through language.

arxiv情報

著者	Nishanth Kumar,Fabio Ramos,Dieter Fox,Caelan Reed Garrett
発行日	2024-11-13 00:02:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー