Hydra Attention: Efficient Attention with Many Heads

要約

トランスフォーマーは視覚における多くのタスクを支配し始めていますが、それらを大きな画像に適用することは依然として計算上困難です。
これの主な理由は、自己注意がトークンの数に応じて 2 次的にスケーリングされ、トークンの数が画像サイズに応じて 2 次的にスケーリングされることです。
より大きな画像 (1080p など) では、ネットワーク内の総計算量の 60% 以上がアテンションマトリックスの作成と適用のみに費やされます。
ビジョントランスフォーマー (ViT) 向けの非常に効率的なアテンション操作である Hydra Attention を導入することで、この問題を解決するための一歩を踏み出しました。
逆説的に言えば、この効率性はマルチヘッド Attention を極端に使用した結果です。機能と同じ数のアテンションヘッドを使用することで、Hydra Attention はトークンと機能の両方で計算上線形であり、隠れた定数がなく、標準のセルフアテンションよりも大幅に高速になります。
既製の ViT-B/16 で、トークンカウントの係数で。
さらに、Hydra Attention は ImageNet で高い精度を維持し、場合によっては実際に精度を向上させます。

要約(オリジナル)

While transformers have begun to dominate many tasks in vision, applying them to large images is still computationally difficult. A large reason for this is that self-attention scales quadratically with the number of tokens, which in turn, scales quadratically with the image size. On larger images (e.g., 1080p), over 60% of the total computation in the network is spent solely on creating and applying attention matrices. We take a step toward solving this issue by introducing Hydra Attention, an extremely efficient attention operation for Vision Transformers (ViTs). Paradoxically, this efficiency comes from taking multi-head attention to its extreme: by using as many attention heads as there are features, Hydra Attention is computationally linear in both tokens and features with no hidden constants, making it significantly faster than standard self-attention in an off-the-shelf ViT-B/16 by a factor of the token count. Moreover, Hydra Attention retains high accuracy on ImageNet and, in some cases, actually improves it.

arxiv情報

著者	Daniel Bolya,Cheng-Yang Fu,Xiaoliang Dai,Peizhao Zhang,Judy Hoffman
発行日	2022-09-15 17:27:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hydra Attention: Efficient Attention with Many Heads

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー