DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

要約

エンドツーエンドの自律運転（E2E-AD）には、マルチビュー感覚データの効果的な処理と、特に攻撃的なターンなどのまれな操作の多様で複雑な運転シナリオの堅牢な処理が必要です。
大規模な言語モデル（LLMS）における専門家の混合物（MOE）アーキテクチャの最近の成功は、パラメーターの専門化が強力なスケーラビリティを可能にすることを示しています。
この作業では、シーン専門のビジョンMOEとスキル専門のアクションMOEを備えた、MOEベースの新しいE2E-ADフレームワークであるDrivemoeを提案します。
Drivemoeは、$ \ Pi_0 $ Vision-Language-comact（VLA）ベースライン（元々は具体化されたAIフィールドから）に基づいて構築されています。
具体的には、ドライバーをトレーニングすることにより、駆動コンテキストに従って動的に関連するカメラを選択することにより、Vision Moeをドライブに追加します-$ \ Pi_0 $を追加します。
この設計は、すべての視覚情報を徹底的に処理するのではなく、ドライバーが重要な視覚的な手がかりに選択的に注意を払う人間の運転認識を反映しています。
さらに、別のルーターをトレーニングすることにより、アクションMOEを追加して、さまざまな運転行動の専門的なエキスパートモジュールをアクティブにします。
明示的な行動の専門化を通じて、Drivemoeは既存のモデルのように平均するモードに苦しむことなく、多様なシナリオを処理できます。
Bench2Driveの閉ループ評価実験では、Drivemoeは最先端（SOTA）のパフォーマンスを達成し、自律運転タスクにおけるビジョンとアクションMOEを組み合わせることの有効性を実証します。
DrivemoeとDrive-$ \ Pi_0 $のコードとモデルをリリースします。

要約(オリジナル)

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE to Drive-$\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$\pi_0$.

arxiv情報

著者	Zhenjie Yang,Yilin Chai,Xiaosong Jia,Qifeng Li,Yuqian Shao,Xuekai Zhu,Haisheng Su,Junchi Yan
発行日	2025-05-22 06:23:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー