KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

要約

大規模な言語モデル（LLMS）とビジョン言語モデル（VLMS）の急速な進歩により、オープンボキャブラリーロボット操作システムの開発において大きな進歩が遂げられました。
ただし、多くの既存のアプローチは、オブジェクトのダイナミクスの重要性を見落としており、より複雑で動的なタスクに適用可能性を制限しています。
この作業では、キーポイントを介したダイナミクス学習と視覚的プロンプトを統合し、VLMSと学習ベースのニューラルダイナミクスモデルの両方を活用するオープンボキャブラリー操作システムであるKudaを紹介します。
私たちの重要な洞察は、キーポイントベースのターゲット仕様はVLMによって同時に解釈可能であり、モデルベースの計画のためにコスト関数に効率的に変換できることです。
言語の指示と視覚的観察が与えられた場合、Kudaは最初にキーポイントをRGB画像に割り当て、VLMをクエリしてターゲット仕様を生成します。
これらの抽象的なキーポイントベースの表現は、コスト関数に変換されます。コスト関数は、学習したダイナミクスモデルを使用してロボットの軌跡を生成します。
多様なオブジェクトカテゴリ全体のフリーフォーム言語命令、多目的相互作用、変形可能または粒状オブジェクトなど、さまざまな操作タスクでKUDAを評価し、フレームワークの有効性を実証します。
プロジェクトページは、http：//kuda-dynamics.github.ioで入手できます。

要約(オリジナル)

With the rapid advancement of large language models (LLMs) and vision-language models (VLMs), significant progress has been made in developing open-vocabulary robotic manipulation systems. However, many existing approaches overlook the importance of object dynamics, limiting their applicability to more complex, dynamic tasks. In this work, we introduce KUDA, an open-vocabulary manipulation system that integrates dynamics learning and visual prompting through keypoints, leveraging both VLMs and learning-based neural dynamics models. Our key insight is that a keypoint-based target specification is simultaneously interpretable by VLMs and can be efficiently translated into cost functions for model-based planning. Given language instructions and visual observations, KUDA first assigns keypoints to the RGB image and queries the VLM to generate target specifications. These abstract keypoint-based representations are then converted into cost functions, which are optimized using a learned dynamics model to produce robotic trajectories. We evaluate KUDA on a range of manipulation tasks, including free-form language instructions across diverse object categories, multi-object interactions, and deformable or granular objects, demonstrating the effectiveness of our framework. The project page is available at http://kuda-dynamics.github.io.

arxiv情報

著者	Zixian Liu,Mingtong Zhang,Yunzhu Li
発行日	2025-03-13 16:59:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー