Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

要約

Vision-Language-active（VLA）モデルは、エンドツーエンドの学習とWebスケールのビジョン言語モデル（VLM）トレーニングからのセマンティック知識の移転を組み合わせることにより、ロボットなどの物理システムのトレーニング制御ポリシーへの強力なアプローチを提供します。
ただし、リアルタイム制御の制約は、多くの場合、VLMの設計と対立しています。最も強力なVLMは、数百億または数千億のパラメーターを持ち、リアルタイムの推論に障害を示し、ロボットを制御するために必要な連続値の出力ではなく、離散トークンで動作します。
この課題に対処するために、最近のVLAモデルでは、アクションエキスパートや連続出力ヘッドなど、効率的な連続制御のために特殊なモジュールを使用しています。これには、通常、新しい訓練されていないパラメーターを前処理したVLMバックボーンに追加する必要があります。
これらのモジュールはリアルタイムおよび制御機能を改善しますが、前処理されたVLMに含まれるセマンティック知識を保存するか分解するか、およびVLAトレーニングダイナミクスにどのような影響を与えるかは、未解決の問題のままです。
この論文では、この質問をVLAのコンテキストで研究します。これは、継続的な拡散またはフローマッチングアクションエキスパートを含み、そのような専門家を含めることがトレーニング速度と知識の移転の両方に大きく害を及ぼすことを示しています。
さまざまな設計の選択肢、パフォーマンスと知識の移転への影響の広範な分析を提供し、この問題を軽減するVLAトレーニング中にVLMバックボーンを断熱するための手法を提案します。
ビデオはhttps://pi.website/research/knowledge_insulationで入手できます。

要約(オリジナル)

Vision-language-action (VLA) models provide a powerful approach to training control policies for physical systems, such as robots, by combining end-to-end learning with transfer of semantic knowledge from web-scale vision-language model (VLM) training. However, the constraints of real-time control are often at odds with the design of VLMs: the most powerful VLMs have tens or hundreds of billions of parameters, presenting an obstacle to real-time inference, and operate on discrete tokens rather than the continuous-valued outputs that are required for controlling robots. To address this challenge, recent VLA models have used specialized modules for efficient continuous control, such as action experts or continuous output heads, which typically require adding new untrained parameters to the pretrained VLM backbone. While these modules improve real-time and control capabilities, it remains an open question whether they preserve or degrade the semantic knowledge contained in the pretrained VLM, and what effect they have on the VLA training dynamics. In this paper, we study this question in the context of VLAs that include a continuous diffusion or flow matching action expert, showing that naively including such experts significantly harms both training speed and knowledge transfer. We provide an extensive analysis of various design choices, their impact on performance and knowledge transfer, and propose a technique for insulating the VLM backbone during VLA training that mitigates this issue. Videos are available at https://pi.website/research/knowledge_insulation.

arxiv情報

著者	Danny Driess,Jost Tobias Springenberg,Brian Ichter,Lili Yu,Adrian Li-Bell,Karl Pertsch,Allen Z. Ren,Homer Walke,Quan Vuong,Lucy Xiaoyang Shi,Sergey Levine
発行日	2025-05-29 17:40:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー