OpenVLA: An Open-Source Vision-Language-Action Model

要約

インターネット規模の視覚言語データと多様なロボットのデモンストレーションを組み合わせて事前訓練された大規模なポリシーは、ロボットに新しいスキルを教える方法を変える可能性を秘めています。新しい行動を一から訓練するのではなく、そのような視覚言語行動を微調整することができます。
VLA) モデルを使用して、視覚運動制御のための堅牢で一般化可能なポリシーを取得します。
しかし、ロボット工学への VLA の広範な導入は、1) 既存の VLA はほとんどが非公開で一般公開されていない、2) 導入の重要な要素である新しいタスクに合わせて VLA を効率的に微調整する方法がこれまでの研究で検討されていないため、課題となっています。
これらの課題に対処するために、97 万個の実世界のロボットデモンストレーションの多様なコレクションでトレーニングされた 7B パラメーターのオープンソース VLA である OpenVLA を紹介します。
OpenVLA は、DINOv2 と SigLIP の事前トレーニングされた機能を融合するビジュアルエンコーダーと組み合わせた Llama 2 言語モデルに基づいて構築されています。
追加されたデータの多様性と新しいモデルコンポーネントの成果として、OpenVLA はジェネラリスト操作に対して優れた結果を示し、29 のタスクと複数のロボットの実施形態にわたる絶対タスク成功率で RT-2-X (55B) などのクローズドモデルを 16.5% 上回りました。
パラメータは 7 分の 1 です。
さらに、新しい設定に合わせて OpenVLA を効果的に微調整でき、複数のオブジェクトと強力な言語基礎能力を含むマルチタスク環境で特に強力な汎化結果が得られ、Diffusion Policy などの表現力豊かなゼロからの模倣学習方法よりも 20.4% 優れていることを示します。
。
また、コンピューティング効率についても調査します。
別の貢献として、最新の低ランク適応手法を介してコンシューマー GPU 上で OpenVLA を微調整し、ダウンストリームの成功率に影響を与えることなく量子化によって効率的に提供できることを示しました。
最後に、モデルチェックポイント、微調整ノートブック、Open X-Embodiment データセット上で大規模に VLA をトレーニングするためのサポートが組み込まれた PyTorch コードベースをリリースします。

要約(オリジナル)

Large policies pretrained on a combination of Internet-scale vision-language data and diverse robot demonstrations have the potential to change how we teach robots new skills: rather than training new behaviors from scratch, we can fine-tune such vision-language-action (VLA) models to obtain robust, generalizable policies for visuomotor control. Yet, widespread adoption of VLAs for robotics has been challenging as 1) existing VLAs are largely closed and inaccessible to the public, and 2) prior work fails to explore methods for efficiently fine-tuning VLAs for new tasks, a key component for adoption. Addressing these challenges, we introduce OpenVLA, a 7B-parameter open-source VLA trained on a diverse collection of 970k real-world robot demonstrations. OpenVLA builds on a Llama 2 language model combined with a visual encoder that fuses pretrained features from DINOv2 and SigLIP. As a product of the added data diversity and new model components, OpenVLA demonstrates strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate across 29 tasks and multiple robot embodiments, with 7x fewer parameters. We further show that we can effectively fine-tune OpenVLA for new settings, with especially strong generalization results in multi-task environments involving multiple objects and strong language grounding abilities, and outperform expressive from-scratch imitation learning methods such as Diffusion Policy by 20.4%. We also explore compute efficiency; as a separate contribution, we show that OpenVLA can be fine-tuned on consumer GPUs via modern low-rank adaptation methods and served efficiently via quantization without a hit to downstream success rate. Finally, we release model checkpoints, fine-tuning notebooks, and our PyTorch codebase with built-in support for training VLAs at scale on Open X-Embodiment datasets.

arxiv情報

著者	Moo Jin Kim,Karl Pertsch,Siddharth Karamcheti,Ted Xiao,Ashwin Balakrishna,Suraj Nair,Rafael Rafailov,Ethan Foster,Grace Lam,Pannag Sanketi,Quan Vuong,Thomas Kollar,Benjamin Burchfiel,Russ Tedrake,Dorsa Sadigh,Sergey Levine,Percy Liang,Chelsea Finn
発行日	2024-09-04 02:14:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OpenVLA: An Open-Source Vision-Language-Action Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー