Latent Action Pretraining from Videos

要約

グランドトゥルースのロボットアクションラベルを使用せずに視覚言語アクション (VLA) モデルを事前トレーニングするための教師なし手法である、一般アクションモデルの潜在アクション事前トレーニング (LAPA) を紹介します。
既存の視覚言語アクションモデルでは、通常、事前トレーニング中に人間の遠隔操作者によって収集されるアクションラベルが必要ですが、これにより、可能なデータソースと規模が大幅に制限されます。
この研究では、ロボットアクションラベルのないインターネット規模のビデオから学習する方法を提案します。
まず、VQ-VAE ベースの目標を活用してアクション量子化モデルをトレーニングして、画像フレーム間の離散的な潜在アクションを学習します。次に、潜在 VLA モデルを事前トレーニングして、観察とタスクの説明からこれらの潜在アクションを予測します。最後に、小規模ロボットの VLA を微調整します。
操作データを潜在的なロボットの動作にマッピングします。
実験結果は、私たちの方法が大規模なビデオからロボット操作ポリシーを訓練する既存の技術を大幅に上回ることを示しています。
さらに、言語条件付け、目に見えないオブジェクトへの一般化、目に見えない命令への意味論的な一般化を必要とする現実世界の操作タスクにおいて、ロボットアクションラベルでトレーニングされた最先端の VLA モデルよりも優れたパフォーマンスを発揮します。
人間の操作ビデオのみを使用したトレーニングでも、ポジティブな伝達が示されており、ロボット工学基盤モデルに Web スケールデータを活用する可能性が開かれています。

要約(オリジナル)

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

arxiv情報

著者	Seonghyeon Ye,Joel Jang,Byeongguk Jeon,Sejune Joo,Jianwei Yang,Baolin Peng,Ajay Mandlekar,Reuben Tan,Yu-Wei Chao,Bill Yuchen Lin,Lars Liden,Kimin Lee,Jianfeng Gao,Luke Zettlemoyer,Dieter Fox,Minjoon Seo
発行日	2024-10-15 16:28:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Latent Action Pretraining from Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー