Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

要約

最近のビジョン言語アクションモデル（VLA）は、前処理されたビジョン言語モデルに基づいて構築され、多様なロボットデータセットを活用して、強力なタスクの実行、言語に従う能力、および意味的一般化を実証します。
これらの成功にもかかわらず、VLAは新しいロボットセットアップと格闘しており、優れたパフォーマンスを達成するために微調整が必要ですが、多くの可能な戦略を考えると、それらを最も効果的に微調整する方法は不明です。
この作業では、OpenVLAを代表的な基本モデルとして使用して、微調整のためのさまざまなアクションデコードスキーム、アクション表現、学習目標など、重要なVLA適応設計の選択肢を研究します。
私たちの経験的分析は、モデルの入力出力仕様の推論効率、ポリシーパフォーマンス、および柔軟性を完全に改善するために、並列デコード、アクションチャンキング、連続的なアクション表現、および単純なL1回帰ベースの学習目標を統合する最適化された微調整（OFT）レシピを通知します。
このレシピのインスタンス化であるOpenVla-Offを提案します。これは、Libero Simulation Benchmarkの新しい最先端を設定し、OpenVLAの平均成功率を76.5％から97.1％に大幅に引き上げ、アクション生成スループットを26 $ \ Times $に増やします。
現実世界の評価では、微調整されたレシピにより、OpenVLAは、二重のAlohaロボットでの器用で高頻度の制御タスクを正常に実行し、デフォルトのレシピ（$ \ PI_0 $およびRDT-1B）を微調整して微調整された他のVLAS（$ \ PI_0 $およびRDT-1B）を上回ることができます。
成功率。
https://openvla-oft.github.io/で、OFTおよび事前に処理されたモデルチェックポイントのコードをリリースします。

要約(オリジナル)

Recent vision-language-action models (VLAs) build upon pretrained vision-language models and leverage diverse robot datasets to demonstrate strong task execution, language following ability, and semantic generalization. Despite these successes, VLAs struggle with novel robot setups and require fine-tuning to achieve good performance, yet how to most effectively fine-tune them is unclear given many possible strategies. In this work, we study key VLA adaptation design choices such as different action decoding schemes, action representations, and learning objectives for fine-tuning, using OpenVLA as our representative base model. Our empirical analysis informs an Optimized Fine-Tuning (OFT) recipe that integrates parallel decoding, action chunking, a continuous action representation, and a simple L1 regression-based learning objective to altogether improve inference efficiency, policy performance, and flexibility in the model’s input-output specifications. We propose OpenVLA-OFT, an instantiation of this recipe, which sets a new state of the art on the LIBERO simulation benchmark, significantly boosting OpenVLA’s average success rate across four task suites from 76.5% to 97.1% while increasing action generation throughput by 26$\times$. In real-world evaluations, our fine-tuning recipe enables OpenVLA to successfully execute dexterous, high-frequency control tasks on a bimanual ALOHA robot and outperform other VLAs ($\pi_0$ and RDT-1B) fine-tuned using their default recipes, as well as strong imitation learning policies trained from scratch (Diffusion Policy and ACT) by up to 15% (absolute) in average success rate. We release code for OFT and pretrained model checkpoints at https://openvla-oft.github.io/.

arxiv情報

著者	Moo Jin Kim,Chelsea Finn,Percy Liang
発行日	2025-04-28 07:49:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー