Revisiting the Integration of Convolution and Attention for Vision Backbone

要約

コンボリューション (Convs) とマルチヘッドセルフアテンション (MHSA) は通常、ビジョンバックボーンを構築するための相互の代替手段とみなされます。
両方を統合しようとする作品もありますが、それらは 2 つの演算子を最も細かいピクセル粒度で同時に適用します。
Convs がすでにピクセルごとの特徴抽出を担当しているため、問題は、このような細かいレベルで重い MHSA を含める必要があるかどうかです。
実際、これがスケーラビリティに関する問題の根本的な原因です。
ビジョントランスフォーマーの入力解像度。
この重要な問題に対処するために、この研究では、代わりに MSHA と Conv を \textbf{異なる粒度レベル} で並行して使用することを提案します。
具体的には、各レイヤーで、画像を表現するために 2 つの異なる方法、つまりきめの細かい規則的なグリッドと、セマンティックスロットの粗いセットを使用します。
これら 2 つの表現に異なる操作を適用します。Conv はローカルフィーチャのグリッドに、MHSA はグローバルフィーチャのスロットに適用されます。
グリッドとセット表現をブリッジするために、完全に微分可能なソフトクラスタリングおよびディスパッチングモジュールのペアが導入され、ローカルとグローバルの融合が可能になります。
さまざまな視覚タスクに関する広範な実験を通じて、\textit{GLMix} という名前の提案された統合スキームの可能性を経験的に検証しました。きめ細かい機能の負担を軽量の Conv にオフロードすることで、いくつかのタスクで MHSA を使用するだけで十分です。
(例: 64) セマンティックスロットにより、最近の最先端のバックボーンのパフォーマンスに匹敵すると同時に、より効率的になります。
私たちの視覚化結果はまた、ソフトクラスタリングモジュールが IN1k 分類監視だけで意味のあるセマンティックグループ化効果を生み出すことを示しています。これにより、より良い解釈可能性がもたらされ、新しい弱く監視されたセマンティックセグメンテーションアプローチが生まれる可能性があります。
コードは \url{https://github.com/rayleizhu/GLMix} で入手できます。

要約(オリジナル)

Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.

arxiv情報

著者	Lei Zhu,Xinjiang Wang,Wayne Zhang,Rynson W. H. Lau
発行日	2024-11-21 18:59:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Revisiting the Integration of Convolution and Attention for Vision Backbone

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー