ROSA: Harnessing Robot States for Vision-Language and Action Alignment

要約

ビジョン言語モデル（VLM）の強力な一般化能力により、ビジョン言語アクション（VLA）モデルは最近、エンドツーエンドのロボット制御、エンドツーエンドのロボット制御に大きな進歩を遂げました。
このようなモデルの開発における基本的な課題は、ビジョン言語空間をロボットアクション空間と効果的に整合することです。
既存のアプローチは通常、専門家のデモンストレーションを使用して、直接微調整VLMに依存しています。
しかし、この戦略は時空間のギャップに苦しんでおり、その結果、かなりのデータの非効率性と人間の労働に大きく依存しています。
空間的には、VLMは高レベルのセマンティック空間内で動作しますが、ロボットアクションは低レベルの3D物理空間に基づいています。
一時的に、VLMは主に現在を解釈し、VLAモデルは将来のアクションを予測します。
これらの課題を克服するために、視覚言語とアクションスペースの間の整合を改善するためにロボット状態の推定を活用する新しいトレーニングパラダイムであるRosaを提案します。
自動化されたプロセスを介して得られたロボット状態推定データを統合することにより、RosaはVLAモデルが空間的理解と自己認識の強化を獲得し、それによりパフォーマンスと一般化を後押しすることができます。
シミュレートされた環境と現実世界の両方の環境での広範な実験は、特に低データレジームでのローザの有効性を示しています。

要約(オリジナル)

Vision-Language-Action (VLA) models have recently made significant advance in multi-task, end-to-end robotic control, due to the strong generalization capabilities of Vision-Language Models (VLMs). A fundamental challenge in developing such models is effectively aligning the vision-language space with the robotic action space. Existing approaches typically rely on directly fine-tuning VLMs using expert demonstrations. However, this strategy suffers from a spatio-temporal gap, resulting in considerable data inefficiency and heavy reliance on human labor. Spatially, VLMs operate within a high-level semantic space, whereas robotic actions are grounded in low-level 3D physical space; temporally, VLMs primarily interpret the present, while VLA models anticipate future actions. To overcome these challenges, we propose a novel training paradigm, ROSA, which leverages robot state estimation to improve alignment between vision-language and action spaces. By integrating robot state estimation data obtained via an automated process, ROSA enables the VLA model to gain enhanced spatial understanding and self-awareness, thereby boosting performance and generalization. Extensive experiments in both simulated and real-world environments demonstrate the effectiveness of ROSA, particularly in low-data regimes.

arxiv情報

著者	Yuqing Wen,Kefan Gu,Haoxuan Liu,Yucheng Zhao,Tiancai Wang,Haoqiang Fan,Xiaoyan Sun
発行日	2025-06-16 16:34:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ROSA: Harnessing Robot States for Vision-Language and Action Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー