Robustifying Token Attention for Vision Transformers

要約

ビジョントランスフォーマー (ViT) は成功を収めていますが、ノイズやブラーなどの一般的な破損が存在すると、依然として精度が大幅に低下します。
興味深いことに、ViT の注意メカニズムは少数の重要なトークンに依存する傾向があることが観察されており、これをトークンのオーバーフォーカスと呼んでいます。
さらに重要なのは、これらのトークンは破損に対して堅牢ではなく、多くの場合、注意のパターンが大きく分散することにつながります。
このペーパーでは、この過集中の問題を軽減し、2 つの一般的な手法によってアテンションをより安定させることを目的としています。まず、トークン認識平均プーリング (TAP) モジュールは、各トークンのローカル近傍がアテンションメカニズムに参加することを奨励します。
具体的には、TAP は、近隣の潜在的に重要なトークンの情報を適応的に考慮できるように、各トークンの平均プーリングスキームを学習します。
次に、注意多様化損失 (ADL) を使用して、出力トークンが少数の入力トークンに焦点を当てるのではなく、多様な入力トークンのセットから情報を集約するように強制します。
これは、異なるトークンのアテンションベクトル間の高いコサイン類似性にペナルティを課すことで実現されます。
実験では、私たちの手法を幅広い変圧器アーキテクチャに適用し、堅牢性を大幅に向上させました。
たとえば、最先端の堅牢なアーキテクチャ FAN に基づいて精度を 0.4% 向上させながら、ImageNet-C での破損耐性を 2.4% 向上させました。
また、セマンティックセグメンテーションタスクを微調整すると、CityScapes-C の堅牢性が 2.4%、ACDC で 3.0% 向上しました。
私たちのコードは https://github.com/guoyongcs/TAPADL で入手できます。

要約(オリジナル)

Despite the success of vision transformers (ViTs), they still suffer from significant drops in accuracy in the presence of common corruptions, such as noise or blur. Interestingly, we observe that the attention mechanism of ViTs tends to rely on few important tokens, a phenomenon we call token overfocusing. More critically, these tokens are not robust to corruptions, often leading to highly diverging attention patterns. In this paper, we intend to alleviate this overfocusing issue and make attention more stable through two general techniques: First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Specifically, TAP learns average pooling schemes for each token such that the information of potentially important tokens in the neighborhood can adaptively be taken into account. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few by using our Attention Diversification Loss (ADL). We achieve this by penalizing high cosine similarity between the attention vectors of different tokens. In experiments, we apply our methods to a wide range of transformer architectures and improve robustness significantly. For example, we improve corruption robustness on ImageNet-C by 2.4% while improving accuracy by 0.4% based on state-of-the-art robust architecture FAN. Also, when fine-tuning on semantic segmentation tasks, we improve robustness on CityScapes-C by 2.4% and ACDC by 3.0%. Our code is available at https://github.com/guoyongcs/TAPADL.

arxiv情報

著者	Yong Guo,David Stutz,Bernt Schiele
発行日	2023-09-06 11:09:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robustifying Token Attention for Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー