Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

要約

最近のマルチモーダル大手言語モデル（MLLM）は驚くべきパフォーマンスを達成しましたが、2次計算の複雑さ、キー価値のキャッシュ要件の増加、および個別のビジョンエンコーダーへの依存により、展開の課題に直面しています。
中程度のアカデミック計算リソースを使用して、既存のMLLMからの進行性蒸留により、線形複数のネイティブマルチモーダル状態空間モデルを開発するためのフレームワークであるMmmambaを提案します。
当社のアプローチにより、訓練されたデコーダーのみのMLLMを、事前に訓練したRNNベースのLLMまたはビジョンエンコーダーを必要とせずに、線形複雑さアーキテクチャに直接変換することができます。
訓練されたトランスからマンバを彫るための播種戦略と3段階の蒸留レシピを提案します。これにより、マルチモーダル機能を保存しながら、トランスからマンバに知識を効果的に転送できます。
この方法は、カスタマイズ可能な効率パフォーマンスのトレードオフのために、変圧器とMamba層を組み合わせた柔軟なハイブリッドアーキテクチャもサポートしています。
変圧器ベースのデコーダーのみのホーブから蒸留されたMmmamba-Linearは、既存の線形および二次的複雑度VLMに対して競争力のあるパフォーマンスを達成し、Mmmamba-HybridはHovleの能力に近づき、パフォーマンスをさらに大幅に向上させます。
103kトークンでは、Mmmamba-LinearはHovleと比較して20.6 $ \ Times $ speedupと75.8％のGPUメモリ削減を示しますが、Mmmamba-Hybridは13.5 $ \ Times $ speedUpと60.2％のメモリの節約を達成します。
コードとモデルはhttps://github.com/hustvl/mmmambaでリリースされます

要約(オリジナル)

Recent Multimodal Large Language Models (MLLMs) have achieved remarkable performance but face deployment challenges due to their quadratic computational complexity, growing Key-Value cache requirements, and reliance on separate vision encoders. We propose mmMamba, a framework for developing linear-complexity native multimodal state space models through progressive distillation from existing MLLMs using moderate academic computational resources. Our approach enables the direct conversion of trained decoder-only MLLMs to linear-complexity architectures without requiring pre-trained RNN-based LLM or vision encoders. We propose an seeding strategy to carve Mamba from trained Transformer and a three-stage distillation recipe, which can effectively transfer the knowledge from Transformer to Mamba while preserving multimodal capabilities. Our method also supports flexible hybrid architectures that combine Transformer and Mamba layers for customizable efficiency-performance trade-offs. Distilled from the Transformer-based decoder-only HoVLE, mmMamba-linear achieves competitive performance against existing linear and quadratic-complexity VLMs, while mmMamba-hybrid further improves performance significantly, approaching HoVLE’s capabilities. At 103K tokens, mmMamba-linear demonstrates 20.6$\times$ speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves 13.5$\times$ speedup and 60.2% memory savings. Code and models are released at https://github.com/hustvl/mmMamba

arxiv情報

著者	Bencheng Liao,Hongyuan Tao,Qian Zhang,Tianheng Cheng,Yingyue Li,Haoran Yin,Wenyu Liu,Xinggang Wang
発行日	2025-02-18 18:59:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー