Interpret Vision Transformers as ConvNets with Dynamic Convolutions

要約

コンピュータービジョンモデルのバックボーンとして機能するビジョン Transformers と ConvNet のどちらが優れているのかについては、議論が行われてきました。
通常、これらは 2 つの完全に異なるアーキテクチャであると考えられますが、この論文では、ビジョントランスフォーマーを動的畳み込みを備えた ConvNet として解釈します。これにより、統合フレームワークで既存のトランスフォーマーと動的 ConvNet を特徴付け、それらの設計選択を並べて比較できるようになります。
さらに、研究者が ConvNet の設計空間からビジョントランスフォーマーを検討したり、その逆も可能になったため、私たちの解釈はネットワーク設計の指針にもなります。
私たちは 2 つの特定の研究を通じてそのような可能性を実証します。
まず、ビジョン Transformers のアクティベーション関数としてのソフトマックスの役割を検査し、ReLU や Layer Normalization などの一般的に使用される ConvNets モジュールで置き換えることができることを発見しました。その結果、収束速度が速くなり、パフォーマンスが向上します。
次に、深さ方向の畳み込みの設計に従って、同等のパフォーマンスでより効率的な、対応する深さ方向のビジョン Transformer を作成します。
提案された統一解釈の可能性は与えられた例に限定されず、それがコミュニティにインスピレーションを与え、より高度なネットワークアーキテクチャを生み出すことができることを願っています。

要約(オリジナル)

There has been a debate about the superiority between vision Transformers and ConvNets, serving as the backbone of computer vision models. Although they are usually considered as two completely different architectures, in this paper, we interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework and compare their design choices side by side. In addition, our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets and vice versa. We demonstrate such potential through two specific studies. First, we inspect the role of softmax in vision Transformers as the activation function and find it can be replaced by commonly used ConvNets modules, such as ReLU and Layer Normalization, which results in a faster convergence rate and better performance. Second, following the design of depth-wise convolution, we create a corresponding depth-wise vision Transformer that is more efficient with comparable performance. The potential of the proposed unified interpretation is not limited to the given examples and we hope it can inspire the community and give rise to more advanced network architectures.

arxiv情報

著者	Chong Zhou,Chen Change Loy,Bo Dai
発行日	2023-09-19 16:00:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpret Vision Transformers as ConvNets with Dynamic Convolutions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー