Latent Action Pretraining from Videos

要約

General Action Models（LAPA）の潜在的なアクション前削除を導入します。これは、グラウンドトゥルースロボットアクションラベルのない視覚視覚アクション（VLA）モデルを事前に移すための監視されていない方法です。
既存のビジョン言語アクションモデルには、通常、可能なデータソースとスケールが大幅に制限されている、事前トレーニング中にヒトのテレオペレーターが通常収集するアクションラベルが必要です。
この作業では、ロボットアクションラベルがないインターネットスケールのビデオから学習する方法を提案します。
最初に、VQ-Vaeベースの目的を活用するアクション量子化モデルをトレーニングして、画像フレーム間で離散的な潜在アクションを学習し、次に潜在的なVLAモデルを前処理して、観測とタスクの説明からこれらの潜在アクションを予測し、最終的にLATENTアクションからロボットアクションにマッピングするための小規模ロボット操作データでVLAを獲得します。
実験結果は、この方法が、大規模なビデオからロボット操作ポリシーを訓練する既存の手法を大幅に上回ることを示しています。
さらに、言語条件付け、目に見えないオブジェクトへの一般化、目に見えない指示への意味的一般化を必要とする現実世界の操作タスクに関するロボットアクションラベルで訓練された最先端のVLAモデルを上回ります。
また、人間の操作ビデオでのみトレーニングは、肯定的な転送を示しており、Robotics FoundationモデルのWebスケールデータを活用する可能性を開きます。

要約(オリジナル)

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

arxiv情報

著者	Seonghyeon Ye,Joel Jang,Byeongguk Jeon,Sejune Joo,Jianwei Yang,Baolin Peng,Ajay Mandlekar,Reuben Tan,Yu-Wei Chao,Bill Yuchen Lin,Lars Liden,Kimin Lee,Jianfeng Gao,Luke Zettlemoyer,Dieter Fox,Minjoon Seo
発行日	2025-05-15 12:13:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Latent Action Pretraining from Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー