A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

要約

私たちは、ロボットが言語命令に基づいて対象物体を把握するという、クラッター状態での言語条件付き把握のタスクに焦点を当てます。
これまでの研究では、対象物体の位置を特定し、その物体の把握を生成するために視覚的なグラウンディングを個別に実行していました。
ただし、これらの作品では、グラウンディングのためのオブジェクトラベルや視覚的属性が必要となるため、プランナーでの手作りのルールが必要となり、言語指示の範囲が制限されます。
この論文では、オブジェクト中心の表現を使用して視覚、言語、およびアクションを共同モデル化することを提案します。
私たちの方法は、より柔軟な言語指示の下で適用でき、視覚的な接地エラーによって制限されません。
さらに、事前トレーニングされたマルチモーダルモデルと把握モデルからの強力な事前分布を利用することで、サンプル効率が効果的に向上し、転送用の追加データなしで sim2real 問題が再現されます。
シミュレーションと現実世界で行われた一連の実験は、私たちの方法がより柔軟な言語命令の下でより少ない動作回数でより高いタスク成功率を達成できることを示しています。
さらに、私たちの方法は、目に見えないオブジェクトや言語命令を含むシナリオに対してより適切に一般化できます。
私たちのコードは https://github.com/xukechun/Vision-Language-Grasping で入手できます。

要約(オリジナル)

We focus on the task of language-conditioned grasping in clutter, in which a robot is supposed to grasp the target object based on a language instruction. Previous works separately conduct visual grounding to localize the target object, and generate a grasp for that object. However, these works require object labels or visual attributes for grounding, which calls for handcrafted rules in planner and restricts the range of language instructions. In this paper, we propose to jointly model vision, language and action with object-centric representation. Our method is applicable under more flexible language instructions, and not limited by visual grounding error. Besides, by utilizing the powerful priors from the pre-trained multi-modal model and grasp model, sample efficiency is effectively improved and the sim2real problem is relived without additional data for transfer. A series of experiments carried out in simulation and real world indicate that our method can achieve better task success rate by less times of motion under more flexible language instructions. Moreover, our method is capable of generalizing better to scenarios with unseen objects and language instructions. Our code is available at https://github.com/xukechun/Vision-Language-Grasping

arxiv情報

著者	Kechun Xu,Shuqi Zhao,Zhongxiang Zhou,Zizhang Li,Huaijin Pi,Yue Wang,Rong Xiong
発行日	2024-10-31 17:22:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Joint Modeling of Vision-Language-Action for Target-oriented Grasping in Clutter

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー