DCT-Based Decorrelated Attention for Vision Transformers

要約

Transformer アーキテクチャの有効性の中心となるのは、セルフアテンションメカニズム、つまりクエリ、キー、値を高次元ベクトル空間にマッピングする機能です。
ただし、クエリ、キー、および値のアテンションの重みをトレーニングすることは、ランダムな初期化の状態からは簡単ではありません。
この論文では 2 つの方法を提案します。
(i) まず、離散コサイン変換 (DCT) 係数を利用した、シンプルでありながら非常に革新的な初期化アプローチを導入することで、ビジョントランスフォーマーの初期化問題に取り組みます。
私たちが提案する DCT ベースのアテンション初期化は、従来の初期化戦略と比較して大きな利点を示します。
アテンションメカニズムの堅牢な基盤を提供します。
私たちの実験では、DCT ベースの初期化により、分類タスクにおける Vision Transformer の精度が向上することが明らかになりました。
(ii) また、DCT は周波数領域で画像情報を効果的に非相関化するため、量子化ステップで高周波成分の多くを破棄できるため、この非相関化が圧縮に役立つことも認識しています。
この観察に基づいて、ビジョントランスフォーマーのアテンション機能のための新しい DCT ベースの圧縮技術を提案します。
高周波 DCT 係数は通常ノイズに対応するため、入力パッチの高周波 DCT 成分を切り捨てます。
DCT ベースの圧縮により、クエリ、キー、および値の重み行列のサイズが削減されます。
当社の DCT 圧縮 Swin Transformer は、同じレベルの精度を維持しながら、計算オーバーヘッドを大幅に削減します。

要約(オリジナル)

Central to the Transformer architectures’ effectiveness is the self-attention mechanism, a function that maps queries, keys, and values into a high-dimensional vector space. However, training the attention weights of queries, keys, and values is non-trivial from a state of random initialization. In this paper, we propose two methods. (i) We first address the initialization problem of Vision Transformers by introducing a simple, yet highly innovative, initialization approach utilizing Discrete Cosine Transform (DCT) coefficients. Our proposed DCT-based attention initialization marks a significant gain compared to traditional initialization strategies; offering a robust foundation for the attention mechanism. Our experiments reveal that the DCT-based initialization enhances the accuracy of Vision Transformers in classification tasks. (ii) We also recognize that since DCT effectively decorrelates image information in the frequency domain, this decorrelation is useful for compression because it allows the quantization step to discard many of the higher-frequency components. Based on this observation, we propose a novel DCT-based compression technique for the attention function of Vision Transformers. Since high-frequency DCT coefficients usually correspond to noise, we truncate the high-frequency DCT components of the input patches. Our DCT-based compression reduces the size of weight matrices for queries, keys, and values. While maintaining the same level of accuracy, our DCT compressed Swin Transformers obtain a considerable decrease in the computational overhead.

arxiv情報

著者	Hongyi Pan,Emadeldeen Hamdan,Xin Zhu,Koushik Biswas,Ahmet Enis Cetin,Ulas Bagci
発行日	2024-05-28 17:56:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DCT-Based Decorrelated Attention for Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー