3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

要約

3Dでのロボット操作には、ロボットマニピュレーターの$ n $ freedomのジョイントスペース軌跡を学ぶ必要があります。
ロボットは、ワークスペースの実際のマッピングをオブジェクト操作に必要な低レベルの制御に変換するためのセマンティックおよび視覚的認識能力を持っている必要があります。
最近の研究により、RGB画像、言語指示、および共同スペース制御の間のマッピングを学習するための大規模なビジョン言語モデル（VLM）の微調整機能が実証されています。
これらのモデルは通常、ワークスペースと言語の命令の入力RGB画像として採用し、テレオ速度のロボットデモンストレーションの大規模なデータセットでトレーニングされています。
この作業では、チェーンの推論、深さ知覚、および関心検出のタスク指向の領域を統合することにより、人気のある最近のビジョン言語アクションモデルのシーンコンテキストの認識を改善する方法を探ります。
Liberoシミュレーション環境での実験は、提案されたモデルである3D-Cavlaがさまざまなリベロタスクスイートの成功率を改善し、98.1 $ $ \％$の平均成功率を達成することを示しています。
また、私たちの方法のゼロショット機能を評価し、3Dシーンの認識が完全に目に見えないタスクの堅牢な学習と適応につながることを示しています。
3D-Cavlaは、目に見えないタスクで8.8 $ \％$の絶対的な改善を達成します。
コードと目に見えないタスクデータセットをオープンソーシングして、コミュニティ主導の研究を促進します：https：//3d-cavla.github.io

要約(オリジナル)

Robotic manipulation in 3D requires learning an $N$ degree-of-freedom joint space trajectory of a robot manipulator. Robots must possess semantic and visual perception abilities to transform real-world mappings of their workspace into the low-level control necessary for object manipulation. Recent work has demonstrated the capabilities of fine-tuning large Vision-Language Models (VLMs) to learn the mapping between RGB images, language instructions, and joint space control. These models typically take as input RGB images of the workspace and language instructions, and are trained on large datasets of teleoperated robot demonstrations. In this work, we explore methods to improve the scene context awareness of a popular recent Vision-Language-Action model by integrating chain-of-thought reasoning, depth perception, and task-oriented region of interest detection. Our experiments in the LIBERO simulation environment show that our proposed model, 3D-CAVLA, improves the success rate across various LIBERO task suites, achieving an average success rate of 98.1$\%$. We also evaluate the zero-shot capabilities of our method, demonstrating that 3D scene awareness leads to robust learning and adaptation for completely unseen tasks. 3D-CAVLA achieves an absolute improvement of 8.8$\%$ on unseen tasks. We will open-source our code and the unseen tasks dataset to promote community-driven research here: https://3d-cavla.github.io

arxiv情報

著者	Vineet Bhat,Yu-Hsiang Lan,Prashanth Krishnamurthy,Ramesh Karri,Farshad Khorrami
発行日	2025-05-09 05:32:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー