Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

要約

最近、ビジョントランスフォーマー (ViT) がコンピュータービジョンの分野で大きな注目を集めています。
一般に、ViT の強力な表現能力は、計算の複雑さが高い自己注意メカニズムから主に恩恵を受けます。
ViT を高速化するために、レイヤー全体で観察された異種の注意パターンに基づいた統合圧縮パイプラインを提案します。
一方で、異なる画像は、後の層よりも初期の層でより類似した注意パターンを共有しており、これは、動的なクエリバイキーセルフアテンション行列が、初期層で静的なセルフアテンション行列に置き換えられる可能性があることを示しています。
次に、行列が置き換えられた動的セルフアテンションからセルフアテンション情報を継承して、ViT の特徴表現能力を効果的に向上させる動的ガイド付き静的セルフアテンション (DGSSA) 手法を提案します。
一方、アテンションマップには、初期の層よりも後の層に、トークンの冗長性を反映する低ランクのパターンが多く含まれています。
線形次元削減の観点から、Deit などの ViT の後続層のトークンの数を削減するためのグローバル集約ピラミッド (GLAD) の方法をさらに提案します。
実験的には、DGSSA と GLAD の統合圧縮パイプラインは、DeiT と比較して実行時のスループットを最大 121% 加速でき、これはすべての SOTA アプローチを上回ります。

要約(オリジナル)

Recently, Vision Transformers (ViTs) have attracted a lot of attention in the field of computer vision. Generally, the powerful representative capacity of ViTs mainly benefits from the self-attention mechanism, which has a high computation complexity. To accelerate ViTs, we propose an integrated compression pipeline based on observed heterogeneous attention patterns across layers. On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers. Then, we propose a dynamic-guided static self-attention (DGSSA) method where the matrix inherits self-attention information from the replaced dynamic self-attention to effectively improve the feature representation ability of ViTs. On the other hand, the attention maps have more low-rank patterns, which reflect token redundancy, in later layers than early layers. In a view of linear dimension reduction, we further propose a method of global aggregation pyramid (GLAD) to reduce the number of tokens in later layers of ViTs, such as Deit. Experimentally, the integrated compression pipeline of DGSSA and GLAD can accelerate up to 121% run-time throughput compared with DeiT, which surpasses all SOTA approaches.

arxiv情報

著者	Deli Yu,Teng Xi,Jianwei Li,Baopu Li,Gang Zhang,Haocheng Feng,Junyu Han,Jingtuo Liu,Errui Ding,Jingdong Wang
発行日	2023-10-11 17:09:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー