Demystify Mamba in Vision: A Linear Attention Perspective

要約

Mamba は、線形計算の複雑さを備えた効果的な状態空間モデルです。
最近では、さまざまな視覚タスクにわたる高解像度入力の処理において、驚くべき効率性を示しています。
この論文では、強力な Mamba モデルが、実際には従来の Transformer よりも性能が劣る線形アテンション Transformer と驚くべき類似点を共有していることを明らかにします。
効果的な Mamba と標準以下のリニアアテンショントランスフォーマーの類似点と相違点を調査することで、Mamba の成功の背後にある重要な要因を解明するための包括的な分析を提供します。
具体的には、選択的状態空間モデルと線形アテンションを統一された定式化の中で再定式化し、Mamba を、入力ゲート、忘却ゲート、ショートカット、アテンションなし正規化、シングルヘッド、修正ブロックという 6 つの主要な特徴を持つ線形アテンショントランスフォーマーの変形として言い換えます。
デザイン。
それぞれの設計について、長所と短所を注意深く分析し、ビジョンタスクにおけるモデルのパフォーマンスへの影響を経験的に評価します。
興味深いことに、この結果では、フォーゲットゲートとブロックの設計が Mamba の成功の中心的な貢献者であることが強調されていますが、他の 4 つの設計はそれほど重要ではありません。
これらの発見に基づいて、これら 2 つの主要な設計のメリットをリニアアテンションに組み込むことにより、Mamba にインスピレーションを得たリニアアテンション (MILA) モデルを提案します。
結果として得られるモデルは、画像分類と高解像度の高密度予測タスクの両方において、さまざまなビジョン Mamba モデルよりも優れたパフォーマンスを発揮し、同時に並列化可能な計算と高速な推論速度を実現します。
コードは https://github.com/LeapLabTHU/MLLA で入手できます。

要約(オリジナル)

Mamba is an effective state space model with linear computation complexity. It has recently shown impressive efficiency in dealing with high-resolution inputs across various vision tasks. In this paper, we reveal that the powerful Mamba model shares surprising similarities with linear attention Transformer, which typically underperform conventional Transformer in practice. By exploring the similarities and disparities between the effective Mamba and subpar linear attention Transformer, we provide comprehensive analyses to demystify the key factors behind Mamba’s success. Specifically, we reformulate the selective state space model and linear attention within a unified formulation, rephrasing Mamba as a variant of linear attention Transformer with six major distinctions: input gate, forget gate, shortcut, no attention normalization, single-head, and modified block design. For each design, we meticulously analyze its pros and cons, and empirically evaluate its impact on model performance in vision tasks. Interestingly, the results highlight the forget gate and block design as the core contributors to Mamba’s success, while the other four designs are less crucial. Based on these findings, we propose a Mamba-Inspired Linear Attention (MILA) model by incorporating the merits of these two key designs into linear attention. The resulting model outperforms various vision Mamba models in both image classification and high-resolution dense prediction tasks, while enjoying parallelizable computation and fast inference speed. Code is available at https://github.com/LeapLabTHU/MLLA.

arxiv情報

著者	Dongchen Han,Ziyi Wang,Zhuofan Xia,Yizeng Han,Yifan Pu,Chunjiang Ge,Jun Song,Shiji Song,Bo Zheng,Gao Huang
発行日	2024-12-02 08:41:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Demystify Mamba in Vision: A Linear Attention Perspective

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー