MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

要約

マルチモーダル大手言語モデル（MLLM）は、複雑な言語と視覚データの理解に優れており、ジェネラリストのロボットシステムが命令を解釈し、具体化されたタスクを実行できるようにします。
それにもかかわらず、彼らの現実世界の展開は、実質的な計算とストレージの要求によって妨げられています。
LLM層の均質なパターンに関する最近の洞察は、早期出口やトークン剪定など、これらの課題に対処するためのスパース化技術に影響を与えました。
ただし、これらの方法は、下流のロボットタスクに最も関連するセマンティック情報をエンコードする最終レイヤーの重要な役割をしばしば無視します。
神経科学における浅い脳仮説（SBH）の最近のブレークスルーとモデルのスパース化における専門家の混合に合わせて、各LLM層を専門家として概念化し、ダイナミックLLM層活性化のための混合物の視覚演算モデル（Mole-VLA、または単にモル）アーキテクチャを提案します。
Mole向けに空間的意識のあるルーター（星）を導入して、ロボットの現在の状態に基づいて層の一部のみを選択的にアクティブにし、認知と因果推論に特化した脳の明確なシグナル経路を模倣します。
さらに、ほくろで失われたLLMの認知能力を補うために、認知自己認識蒸留（COGKD）フレームワークを考案します。
COGKDは、タスクの要求の理解を高め、認知機能を活用することにより、タスク関連のアクションシーケンスの生成を改善します。
RLBenchシミュレーションと現実世界の両方の環境で実施された広範な実験は、効率とパフォーマンスの両方におけるモルVLAの優位性を示しています。
具体的には、Mole-VLAは、標準のLLMと比較して、10のタスクにわたって平均成功率の平均成功率が8％改善され、計算コストをx5.6まで削減します。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) excel in understanding complex language and visual data, enabling generalist robotic systems to interpret instructions and perform embodied tasks. Nevertheless, their real-world deployment is hindered by substantial computational and storage demands. Recent insights into the homogeneous patterns in the LLM layer have inspired sparsification techniques to address these challenges, such as early exit and token pruning. However, these methods often neglect the critical role of the final layers that encode the semantic information most relevant to downstream robotic tasks. Aligning with the recent breakthrough of the Shallow Brain Hypothesis (SBH) in neuroscience and the mixture of experts in model sparsification, we conceptualize each LLM layer as an expert and propose a Mixture-of-Layers Vision-Language-Action model (MoLe-VLA, or simply MoLe) architecture for dynamic LLM layer activation. We introduce a Spatial-Temporal Aware Router (STAR) for MoLe to selectively activate only parts of the layers based on the robot’s current state, mimicking the brain’s distinct signal pathways specialized for cognition and causal reasoning. Additionally, to compensate for the cognitive ability of LLMs lost in MoLe, we devise a Cognition Self-Knowledge Distillation (CogKD) framework. CogKD enhances the understanding of task demands and improves the generation of task-relevant action sequences by leveraging cognitive features. Extensive experiments conducted in both RLBench simulation and real-world environments demonstrate the superiority of MoLe-VLA in both efficiency and performance. Specifically, MoLe-VLA achieves an 8% improvement in the mean success rate across ten tasks while reducing computational costs by up to x5.6 compared to standard LLMs.

arxiv情報

著者	Rongyu Zhang,Menghang Dong,Yuan Zhang,Liang Heng,Xiaowei Chi,Gaole Dai,Li Du,Dan Wang,Yuan Du,Shanghang Zhang
発行日	2025-03-26 10:05:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MoLe-VLA: Dynamic Layer-skipping Vision Language Action Model via Mixture-of-Layers for Efficient Robot Manipulation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー