What do Vision Transformers Learn? A Visual Exploration

要約

ビジョントランスフォーマー (ViT) は、急速にコンピュータービジョンのデファクトアーキテクチャになりつつあります。
既存の研究では畳み込みニューラルネットワークのメカニズムを視覚的に分析していますが、ViT の類似の調査は依然として困難です。
このホワイトペーパーでは、まず、ViT で視覚化を実行する際の障害に対処します。
これらのソリューションの助けを借りて、言語モデルの監督 (CLIP など) でトレーニングされた ViT のニューロンは、視覚的特徴ではなく意味概念によって活性化されることが観察されます。
また、ViT と CNN の根本的な違いについても調査し、トランスフォーマーは、畳み込みのカウンターパートと同様に、画像の背景の特徴を検出しますが、その予測は高周波情報にあまり依存しないことを発見しました。
一方、両方のアーキテクチャタイプは、機能が初期層の抽象的なパターンから後期層の具体的なオブジェクトに進む方法で同様に動作します。
さらに、ViT が最終層を除くすべての層で空間情報を維持することを示します。
以前の作品とは対照的に、最後のレイヤーが空間情報を破棄する可能性が最も高く、学習されたグローバルプーリング操作として動作することを示します。
最後に、DeiT、CoaT、ConViT、PiT、Swin、Twin などの幅広い ViT バリアントで大規模な視覚化を行い、この方法の有効性を検証します。

要約(オリジナル)

Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision, yet we understand very little about why they work and what they learn. While existing studies visually analyze the mechanisms of convolutional neural networks, an analogous exploration of ViTs remains challenging. In this paper, we first address the obstacles to performing visualizations on ViTs. Assisted by these solutions, we observe that neurons in ViTs trained with language model supervision (e.g., CLIP) are activated by semantic concepts rather than visual features. We also explore the underlying differences between ViTs and CNNs, and we find that transformers detect image background features, just like their convolutional counterparts, but their predictions depend far less on high-frequency information. On the other hand, both architecture types behave similarly in the way features progress from abstract patterns in early layers to concrete objects in late layers. In addition, we show that ViTs maintain spatial information in all layers except the final layer. In contrast to previous works, we show that the last layer most likely discards the spatial information and behaves as a learned global pooling operation. Finally, we conduct large-scale visualizations on a wide range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin, to validate the effectiveness of our method.

arxiv情報

著者	Amin Ghiasi,Hamid Kazemi,Eitan Borgnia,Steven Reich,Manli Shu,Micah Goldblum,Andrew Gordon Wilson,Tom Goldstein
発行日	2022-12-13 16:55:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

What do Vision Transformers Learn? A Visual Exploration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー