Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

要約

自然言語処理における Transformer の手ごわい成果は、コンピュータービジョンコミュニティの研究者に Vision Transformer を構築する動機を与えました。
畳み込みニューラルネットワーク (CNN) と比較して、ビジョントランスフォーマーには、長期的な依存関係を特徴付けることができるより大きな受容野があります。
それにもかかわらず、Vision Transformer の広い受容野には膨大な計算コストが伴います。
効率を高めるために、ウィンドウベースのビジョントランスフォーマーが登場します。
画像を複数のローカルウィンドウに切り取り、各ウィンドウ内で自己注意を行います。
グローバルな受容野を取り戻すために、ウィンドウベースのビジョントランスフォーマーは、いくつかの洗練された操作を開発することにより、ウィンドウ間のコミュニケーションを実現するために多くの努力を払ってきました.
この作業では、Swin Transformer の重要な設計要素であるシフトウィンドウパーティションの必要性を確認します。
効果的なクロスウィンドウ通信を実現するには、単純な深さ方向の畳み込みで十分であることがわかりました。
具体的には、深さ方向の畳み込みが存在するため、Swin Transformer のシフトされたウィンドウ構成は、追加のパフォーマンスの向上につながることはありません。
したがって、洗練されたシフトされたウィンドウ分割を破棄することにより、Swin Transformer を単純な Window ベース (Win) Transformer に退化させます。
提案された Win Transformer は、Swin Transformer よりも概念的に単純で実装が容易です。
一方、当社の Win Transformer は、画像認識、セマンティックセグメンテーション、オブジェクト検出などの複数のコンピュータービジョンタスクで、Swin Transformer より一貫して優れたパフォーマンスを達成します。

要約(オリジナル)

The formidable accomplishment of Transformers in natural language processing has motivated the researchers in the computer vision community to build Vision Transformers. Compared with the Convolution Neural Networks (CNN), a Vision Transformer has a larger receptive field which is capable of characterizing the long-range dependencies. Nevertheless, the large receptive field of Vision Transformer is accompanied by the huge computational cost. To boost efficiency, the window-based Vision Transformers emerge. They crop an image into several local windows, and the self-attention is conducted within each window. To bring back the global receptive field, window-based Vision Transformers have devoted a lot of efforts to achieving cross-window communications by developing several sophisticated operations. In this work, we check the necessity of the key design element of Swin Transformer, the shifted window partitioning. We discover that a simple depthwise convolution is sufficient for achieving effective cross-window communications. Specifically, with the existence of the depthwise convolution, the shifted window configuration in Swin Transformer cannot lead to an additional performance improvement. Thus, we degenerate the Swin Transformer to a plain Window-based (Win) Transformer by discarding sophisticated shifted window partitioning. The proposed Win Transformer is conceptually simpler and easier for implementation than Swin Transformer. Meanwhile, our Win Transformer achieves consistently superior performance than Swin Transformer on multiple computer vision tasks, including image recognition, semantic segmentation, and object detection.

arxiv情報

著者	Tan Yu,Ping Li
発行日	2022-11-25 17:36:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Degenerate Swin to Win: Plain Window-based Transformer without Sophisticated Operations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー