Task-oriented Robotic Manipulation with Vision Language Models

要約

ビジョン言語モデル（VLM）は、ロボットがオブジェクトとその周囲の視覚特性を理解して解釈し、このマルチモーダルの理解に基づいて操作を実行できるようにすることにより、ロボット操作において重要な役割を果たします。
空間的関係を正確に理解することは、非自明の課題のままですが、効果的なロボット操作には不可欠です。
この作業では、VLMSを構造化された空間推論パイプラインと統合して、高レベルのタスク指向の入力に基づいてオブジェクト操作を実行する新しいフレームワークを紹介します。
私たちのアプローチは、視覚的なシーンを空間的関係をコードするツリー構造表現への変換です。
その後、これらのツリーは大規模な言語モデル（LLM）によって処理され、これらのオブジェクトを特定の高レベルタスクのために編成する方法を決定する再構築された構成を推測します。
フレームワークをサポートするために、オブジェクト間の空間的関係を説明する手動で注釈付きのキャプションを含む新しいデータセットと、脆弱性、質量、材料、透明度などのオブジェクトレベルの属性アノテーションも提示します。
私たちの方法は、視覚環境のオブジェクト間の空間的関係の理解を改善するだけでなく、ロボットがこれらのオブジェクトとより効果的に相互作用できるようにすることを実証します。
その結果、このアプローチは、ロボット操作タスクの空間的推論を大幅に強化します。
私たちの知る限り、これは文学のこの種の最初の方法であり、ロボットが周囲のオブジェクトをより効率的に整理し、利用できるようにする新しいソリューションを提供します。

要約(オリジナル)

Vision Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. Accurately understanding spatial relationships remains a non-trivial challenge, yet it is essential for effective robotic manipulation. In this work, we introduce a novel framework that integrates VLMs with a structured spatial reasoning pipeline to perform object manipulation based on high-level, task-oriented input. Our approach is the transformation of visual scenes into tree-structured representations that encode the spatial relations. These trees are subsequently processed by a Large Language Model (LLM) to infer restructured configurations that determine how these objects should be organised for a given high-level task. To support our framework, we also present a new dataset containing manually annotated captions that describe spatial relations among objects, along with object-level attribute annotations such as fragility, mass, material, and transparency. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

arxiv情報

著者	Nurhan Bulus Guran,Hanchi Ren,Jingjing Deng,Xianghua Xie
発行日	2025-05-20 09:42:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Task-oriented Robotic Manipulation with Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー