Improving Vision Transformers by Revisiting High-frequency Components

要約

トランスモデルは、さまざまなビジョンタスクを処理する上で有望な効果を示しています。
ただし、畳み込みニューラルネットワーク（CNN）モデルのトレーニングと比較すると、Vision Transformer（ViT）モデルのトレーニングはより難しく、大規模なトレーニングセットに依存しています。
この観察結果を説明するために、\ textit {ViTモデルはCNNモデルよりも画像の高周波成分のキャプチャに効果が低い}という仮説を立て、周波数分析によって検証します。
この発見に触発されて、最初に新しい周波数の観点からViTモデルを改善するための既存の手法の効果を調査し、いくつかの手法（RandAugmentなど）の成功は高周波コンポーネントのより良い使用に起因する可能性があることを発見しました。
次に、ViTモデルのこの不十分な能力を補うために、敵対的なトレーニングを介して画像の高周波成分を直接増強するHATを提案します。
HATがさまざまなViTモデルのパフォーマンスを一貫して向上させ（たとえば、ViT-Bの場合は+ 1.2％、Swin-Bの場合は+ 0.5％）、特にImageNet-のみを使用する高度なモデルVOLO-D5を87.3％に強化できることを示します。
1Kデータ、および優位性は、配布外のデータでも維持され、ダウンストリームタスクに転送されます。
コードはhttps://github.com/jiawangbai/HATで入手できます。

要約(オリジナル)

The transformer models have shown promising effectiveness in dealing with various vision tasks. However, compared with training Convolutional Neural Network (CNN) models, training Vision Transformer (ViT) models is more difficult and relies on the large-scale training set. To explain this observation we make a hypothesis that \textit{ViT models are less effective in capturing the high-frequency components of images than CNN models}, and verify it by a frequency analysis. Inspired by this finding, we first investigate the effects of existing techniques for improving ViT models from a new frequency perspective, and find that the success of some techniques (e.g., RandAugment) can be attributed to the better usage of the high-frequency components. Then, to compensate for this insufficient ability of ViT models, we propose HAT, which directly augments high-frequency components of images via adversarial training. We show that HAT can consistently boost the performance of various ViT models (e.g., +1.2% for ViT-B, +0.5% for Swin-B), and especially enhance the advanced model VOLO-D5 to 87.3% that only uses ImageNet-1K data, and the superiority can also be maintained on out-of-distribution data and transferred to downstream tasks. The code is available at: https://github.com/jiawangbai/HAT.

arxiv情報

著者	Jiawang Bai,Li Yuan,Shu-Tao Xia,Shuicheng Yan,Zhifeng Li,Wei Liu
発行日	2022-07-27 09:49:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Vision Transformers by Revisiting High-frequency Components

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー