UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

要約

ジェネラリストのロボットは、さまざまな環境で効果的に機能する必要があります。
ただし、ほとんどの既存のアプローチは、アクションが発表したデータのスケーリングに大きく依存して機能を強化しています。
その結果、それらは多くの場合、単一の物理的仕様に限定され、さまざまな実施形態と環境で移転可能な知識を学ぶのに苦労します。
これらの制限に立ち向かうために、クロスエンボジメントビジョン言語アクション（VLA）ポリシーを学習するための新しいフレームワークであるUnivlaを提案します。
私たちの重要な革新は、潜在的なアクションモデルを使用したビデオからタスク中心のアクション表現を導き出すことです。
これにより、幅広い実施形態と視点で広範なデータを活用することができます。
タスクに関係なくダイナミクスの効果を緩和するために、言語の指示を組み込み、Dino機能空間内に潜在アクションモデルを確立します。
インターネット規模のビデオから学んだのは、ジェネラリストのポリシーを効率的な潜在アクションデコードを通じてさまざまなロボットに展開できます。
複数の操作およびナビゲーションベンチマーク、および実際のロボット展開で最先端の結果を取得します。
Univlaは、OpenVLAよりも優れたパフォーマンスを達成し、1/20未満のプレイトレーニング計算と1/10のダウンストリームデータを獲得しています。
継続的なパフォーマンスの改善は、人間のビデオを含めても、トレーニングパイプラインに組み込まれている不均一なデータとして観察されます。
結果は、スケーラブルで効率的なロボットポリシー学習を促進するUnivlaの可能性を強調しています。

要約(オリジナル)

A generalist robot should perform effectively across various environments. However, most existing approaches heavily rely on scaling action-annotated data to enhance their capabilities. Consequently, they are often limited to single physical specification and struggle to learn transferable knowledge across different embodiments and environments. To confront these limitations, we propose UniVLA, a new framework for learning cross-embodiment vision-language-action (VLA) policies. Our key innovation is to derive task-centric action representations from videos with a latent action model. This enables us to exploit extensive data across a wide spectrum of embodiments and perspectives. To mitigate the effect of task-irrelevant dynamics, we incorporate language instructions and establish a latent action model within the DINO feature space. Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding. We obtain state-of-the-art results across multiple manipulation and navigation benchmarks, as well as real-robot deployments. UniVLA achieves superior performance over OpenVLA with less than 1/20 of pretraining compute and 1/10 of downstream data. Continuous performance improvements are observed as heterogeneous data, even including human videos, are incorporated into the training pipeline. The results underscore UniVLA’s potential to facilitate scalable and efficient robot policy learning.

arxiv情報

著者	Qingwen Bu,Yanting Yang,Jisong Cai,Shenyuan Gao,Guanghui Ren,Maoqing Yao,Ping Luo,Hongyang Li
発行日	2025-05-09 15:11:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー