CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

要約

近年、ビジョン言語アクション（VLA）モデルは、印象的なマルチモーダルの理解と一般化能力により、ロボット工学の重要な研究方向になっています。
進捗状況にもかかわらず、それらの実際の展開は、特に高周波および器用な操作タスクで、推論速度のボトルネックによって厳しく制約されます。
最近の研究では、ヤコビのデコードが従来の自己回帰デコードに代わるより効率的な代替として調査されていますが、その実際の利点は長い繰り返しによりわずかです。
それに対処するために、各反復で複数の正しいアクショントークンを予測するために一貫性蒸留トレーニングを導入し、それによって加速を達成します。
その上、私たちは混合ラベルの監督を設計して、蒸留中のエラーの蓄積を軽減します。
蒸留は許容可能なスピードアップをもたらしますが、特定の非効率的な反復が重要なボトルネックであることを特定します。
これに取り組むために、収束条件を中程度に緩和する早期排出デコード戦略を提案します。これにより、平均的な推論効率がさらに向上します。
実験結果は、提案された方法が、シミュレートされたロボットタスクと実際のロボットタスクの両方で高いタスクの成功率を維持しながら、異なるベースラインで4倍以上の推論加速を達成することを示しています。
これらの実験は、私たちのアプローチがロボット工学におけるマルチモーダルの意思決定を加速するための効率的かつ一般的なパラダイムを提供することを検証します。
プロジェクトページは、https：//irpn-eai.github.io/ceht-vla/で入手できます。

要約(オリジナル)

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.

arxiv情報

著者	Wenxuan Song,Jiayi Chen,Pengxiang Ding,Yuxin Huang,Han Zhao,Donglin Wang,Haoang Li
発行日	2025-06-16 17:31:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー