Demystify Transformers & Convolutions in Modern Image Deep Networks

要約

ビジョントランスフォーマーの最近の成功は、新しい機能変換パラダイムを備えた一連のビジョンバックボーンに影響を与え、安定したパフォーマンスの向上を報告しています。
斬新な機能変換設計はしばしば利益の源として主張されますが、一部のバックボーンは高度なエンジニアリング技術の恩恵を受ける可能性があり、主要な機能変換演算子からの実際の利益を特定することは困難です.
この論文では、一般的な畳み込み演算子と注意演算子の実際のゲインを特定し、それらを詳細に調査することを目的としています。
注意や畳み込みなど、これらの特徴変換モジュール間の主な違いは、空間特徴集約の方法、またはいわゆる「空間トークンミキサー」(STM) にあることがわかります。
したがって、最初に統一されたアーキテクチャを作成して、さまざまなエンジニアリング手法の不公平な影響を排除し、次に比較のために STM をこのアーキテクチャに適合させます。
アップストリーム/ダウンストリームタスクに関するさまざまな実験と誘導バイアスの分析に基づいて、エンジニアリング手法によってパフォーマンスが大幅に向上することがわかりましたが、異なる STM 間にはパフォーマンスギャップが依然として存在します。
詳細な分析により、有効な受容野や不変性テストなど、さまざまな STM の興味深い発見も明らかになります。
コードとトレーニング済みモデルは、https://github.com/OpenGVLab/STM-Evaluation で公開されます。

要約(オリジナル)

Recent success of vision transformers has inspired a series of vision backbones with novel feature transformation paradigms, which report steady performance gain. Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators. In this paper, we aim to identify real gain of popular convolution and attention operators and make an in-depth study of them. We observe that the main difference among these feature transformation modules, e.g., attention or convolution, lies in the way of spatial feature aggregation, or the so-called ‘spatial token mixer’ (STM). Hence, we first elaborate a unified architecture to eliminate the unfair impact of different engineering techniques, and then fit STMs into this architecture for comparison. Based on various experiments on upstream/downstream tasks and the analysis of inductive bias, we find that the engineering techniques boost the performance significantly, but the performance gap still exists among different STMs. The detailed analysis also reveals some interesting findings of different STMs, such as effective receptive fields and invariance tests. The code and trained models will be publicly available at https://github.com/OpenGVLab/STM-Evaluation

arxiv情報

著者	Jifeng Dai,Min Shi,Weiyun Wang,Sitong Wu,Linjie Xing,Wenhai Wang,Xizhou Zhu,Lewei Lu,Jie Zhou,Xiaogang Wang,Yu Qiao,Xiaowei Hu
発行日	2022-11-10 18:59:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Demystify Transformers & Convolutions in Modern Image Deep Networks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー