Teaching Matters: Investigating the Role of Supervision in Vision Transformers

要約

ビジョントランスフォーマー (ViT) は、近年大きな人気を得ており、多くのアプリケーションに普及しています。
ただし、さまざまな学習パラダイムの下で彼らの行動がどのように変化するかは十分に調査されていません。
さまざまな監督方法でトレーニングされた ViT を比較し、注意、表現、および下流のパフォーマンスに関してさまざまな範囲の行動を学習することを示します。
また、オフセットローカルアテンションヘッドの出現など、監視全体で一貫した ViT の動作も発見しました。
これらは、固定方向オフセットで現在のトークンに隣接するトークンに注意を向ける自己注意ヘッドです。これは、私たちの知る限り、以前の研究では強調されていない現象です。
私たちの分析によると、ViT は非常に柔軟であり、トレーニング方法に応じてローカル情報とグローバル情報を異なる順序で処理することを学習します。
対照的な自己教師ありメソッドは、明示的に教師ありの機能と競合する機能を学習し、部分レベルのタスクに対しても優れている可能性があることがわかりました。
また、再構成ベースのモデルの表現が、対照的な自己教師ありモデルとの自明でない類似性を示すこともわかりました。
最後に、特定のタスクの「最適な」レイヤーが監視方法とタスクの両方によってどのように変化するかを示し、ViT での情報処理の順序の違いをさらに示します。

要約(オリジナル)

Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, it is not well explored how varied their behavior is under different learning paradigms. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. Finally, we show how the ‘best’ layer for a given task varies by both supervision method and task, further demonstrating the differing order of information processing in ViTs.

arxiv情報

著者	Matthew Walmer,Saksham Suri,Kamal Gupta,Abhinav Shrivastava
発行日	2022-12-07 18:59:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Teaching Matters: Investigating the Role of Supervision in Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー