MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

要約

線形の複雑さを伴うRNNモデルの進歩により、変圧器の二次複雑さの課題は克服される可能性があります。
特に、新興MAMBA-2は競争力のあるパフォーマンスを実証し、RNNモデルと変圧器の間のギャップを埋めています。
ただし、連続的な処理と消失の勾配により、RNNモデルは長距離依存関係をキャプチャするのに苦労し、コンテキストの理解を制限しています。
これにより、収束が遅くなり、リソースの需要が高く、下流の理解と複雑な推論タスクのパフォーマンスが低下します。
この作業では、トランスデコーダー層の一部をMAMBA-2層を備えた事前に訓練されたVLMに置き換えることにより、ハイブリッドモデルMATVLMを提示します。
注意とMAMBA-2の固有の関係を活用すると、MAMBA-2を初期化して、収束を加速するために対応する注意重みを初期化します。
その後、事前に訓練されたVLMを教師モデルとして使用して知識をMATVLMに転送し、収束速度とパフォーマンスをさらに向上させる単一段階の蒸留プロセスを採用します。
さらに、トレーニングフレームワーク内での蒸留損失の微分損失の影響を調査します。
複数のベンチマークでMATVLMを評価し、MAMBAベースのVLMと同等のパラメータースケールのモデルの両方を超えながら、教師モデルと既存のVLMに対する競争力のあるパフォーマンスを実証します。
驚くべきことに、MATVLMは、教師モデルよりも最大3.6倍の推論を達成し、GPUメモリ消費量を27.5％削減し、すべてパフォーマンスを損なうことなく。
コードとモデルはhttp://github.com/hustvl/matvlmでリリースされます。

要約(オリジナル)

With the advancement of RNN models with linear complexity, the quadratic complexity challenge of transformers has the potential to be overcome. Notably, the emerging Mamba-2 has demonstrated competitive performance, bridging the gap between RNN models and transformers. However, due to sequential processing and vanishing gradients, RNN models struggle to capture long-range dependencies, limiting contextual understanding. This results in slow convergence, high resource demands, and poor performance on downstream understanding and complex reasoning tasks. In this work, we present a hybrid model MaTVLM by substituting a portion of the transformer decoder layers in a pre-trained VLM with Mamba-2 layers. Leveraging the inherent relationship between attention and Mamba-2, we initialize Mamba-2 with corresponding attention weights to accelerate convergence. Subsequently, we employ a single-stage distillation process, using the pre-trained VLM as the teacher model to transfer knowledge to the MaTVLM, further enhancing convergence speed and performance. Furthermore, we investigate the impact of differential distillation loss within our training framework. We evaluate the MaTVLM on multiple benchmarks, demonstrating competitive performance against the teacher model and existing VLMs while surpassing both Mamba-based VLMs and models of comparable parameter scales. Remarkably, the MaTVLM achieves up to 3.6x faster inference than the teacher model while reducing GPU memory consumption by 27.5%, all without compromising performance. Code and models are released at http://github.com/hustvl/MaTVLM.

arxiv情報

著者	Yingyue Li,Bencheng Liao,Wenyu Liu,Xinggang Wang
発行日	2025-03-17 17:59:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー