Task-oriented Robotic Manipulation with Vision Language Models

要約

視覚言語モデル (VLM) は、ロボットが物体とその周囲の視覚的特性を理解し解釈できるようにすることで、ロボット操作において重要な役割を果たし、このマルチモーダルな理解に基づいて操作を実行できるようにします。
ただし、オブジェクトの属性と空間関係を理解することは簡単な作業ではありませんが、ロボット操作タスクでは重要です。
この研究では、空間関係と属性割り当てに焦点を当てた新しいデータセットと、VLM を利用してタスク指向の高レベル入力によるオブジェクト操作を実行する新しい方法を紹介します。
このデータセットでは、オブジェクト間の空間関係がキャプションとして手動で記述されます。
さらに、各オブジェクトには、微調整された視覚言語モデルから派生した、脆弱性、質量、材質、透明度などの複数の属性が付けられます。
キャプションから埋め込まれたオブジェクト情報は自動的に抽出され、各画像内のオブジェクト間の空間関係を捉えるデータ構造 (この場合はデモンストレーション用のツリー) に変換されます。
次に、ツリー構造はオブジェクト属性とともに言語モデルに入力され、特定の (高レベルの) タスクを達成するためにこれらのオブジェクトをどのように編成するかを決定する新しいツリー構造に変換されます。
私たちは、私たちの方法が視覚環境内のオブジェクト間の空間関係の理解を改善するだけでなく、ロボットがこれらのオブジェクトとより効果的に対話できるようにすることを実証します。
結果として、このアプローチはロボット操作タスクにおける空間推論を大幅に強化します。
私たちの知る限り、これは文献にあるこの種の最初の方法であり、ロボットが周囲の物体をより効率的に整理して利用できるようにする新しいソリューションを提供します。

要約(オリジナル)

Vision-Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. However, understanding object attributes and spatial relationships is a non-trivial task but is critical in robotic manipulation tasks. In this work, we present a new dataset focused on spatial relationships and attribute assignment and a novel method to utilize VLMs to perform object manipulation with task-oriented, high-level input. In this dataset, the spatial relationships between objects are manually described as captions. Additionally, each object is labeled with multiple attributes, such as fragility, mass, material, and transparency, derived from a fine-tuned vision language model. The embedded object information from captions are automatically extracted and transformed into a data structure (in this case, tree, for demonstration purposes) that captures the spatial relationships among the objects within each image. The tree structures, along with the object attributes, are then fed into a language model to transform into a new tree structure that determines how these objects should be organized in order to accomplish a specific (high-level) task. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

arxiv情報

著者	Nurhan Bulus Guran,Hanchi Ren,Jingjing Deng,Xianghua Xie
発行日	2024-10-21 10:43:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Task-oriented Robotic Manipulation with Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー