VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

要約

VLA（Vision-Language-Action）モデルは、その強力なマルチモーダル推論能力により、エンドツーエンドで出力として行動を直接生成するために、命令と視覚知覚を処理することができます。VLAモデルの性能は有望ですが、その計算コストは相当なものになります。このことは、環境の変化に素早く対応するためにリアルタイムの意思決定が要求されるロボットタスクにVLAモデルを適用する際の課題となっている。ロボット制御は逐次的な意思決定を伴うため、視覚入力は連続するステップ間で最小限の変化しか示さないことが多い。自然なアイデアは、最後のステップから変化していない視覚的トークンの計算結果を再利用することである。このアイデアに動機づけられ、我々は効率的な視覚-言語-行動モデルであるVLA-Cacheを提案する。VLA-Cacheは、各ステップでの視覚入力を前のステップからの入力と比較するトークン選択機構を組み込んでおり、変化の少ない視覚トークンを適応的に識別する。これらの変更されていないトークンに対する計算結果は、KV-cacheを介して後続のステップで再利用され、VLA-Cacheモデルの効率を大幅に改善する。シミュレーション（LIBEROベンチマークやSIMPLERなど）と実世界のロボットの両方で実験した結果、VLA-Cacheは成功率の犠牲を最小限に抑えながら実用的な高速化を達成できることが実証された。

要約(オリジナル)

Vision-Language-Action (VLA) model can process instructions and visual perception to directly generate actions as output in an end-to-end fashion due to its strong multi-modal reasoning capabilities. While the performance of VLA models is promising, their computational cost can be substantial. This raises challenge for applying them on robotics tasks, which requires real-time decision-making to respond quickly to environmental changes. Since robotic control involves sequential decision-making, the visual input often exhibits minimal variation between successive steps. A natural idea is to reuse the computational results of unchanged visual tokens from the last step. Motivated by this idea, we propose VLA-Cache, an efficient vision-language-action model. VLA-Cache incorporates a token-selection mechanism that compares the visual input at each step with the input from the previous step, adaptively identifying visual tokens with minimal changes. The computational results for these unchanged tokens are then reused in subsequent steps via KV-cache, thereby significantly improving the efficiency of the VLA-Cache model. Experimental results on both simulation (e.g., LIBERO benchmark and SIMPLER) and real-world robot valid VLA-Cache can achieve practical acceleration with minimal sacrifice in success rate.

arxiv情報

著者	Siyu Xu,Yunke Wang,Chenghao Xia,Dihao Zhu,Tao Huang,Chang Xu
発行日	2025-02-04 09:48:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

VLA-Cache: Towards Efficient Vision-Language-Action Model via Adaptive Token Caching in Robotic Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー