When Attention Sink Emerges in Language Models: An Empirical View

要約

言語モデル (LM) は、たとえ意味的に重要でなくても、最初のトークンに重要な注意を割り当てます。これはアテンションシンクとして知られています。
この現象は、ストリーミング/ロングコンテキストの生成、KV キャッシュの最適化、推論の高速化、モデルの量子化などのアプリケーションで広く採用されています。
広く使用されているにもかかわらず、LM における注意の低下についての深い理解はまだ不足しています。
この研究では、小さなモデルであっても、さまざまな入力を持つ LM にアテンションシンクが普遍的に存在することを初めて実証します。
さらに、LM 事前トレーニング中にアテンションシンクが出現することが観察されており、LM 事前トレーニングにおける最適化、データ分散、損失関数、モデルアーキテクチャがその出現にどのように影響するかを調査する動機となります。
十分なトレーニングデータに対する効果的な最適化の後にアテンションシンクが現れることを強調します。
シンクの位置は、損失関数およびデータ分布と高い相関があります。
最も重要なことは、アテンションシンクはキーバイアスのように機能し、有益ではなく価値の計算に寄与しない可能性がある追加の注意スコアを保存することがわかりました。
また、この現象は (少なくとも部分的には) ソフトマックス正規化の結果としての注意スコアに対するトークンの内部依存に起因することも観察されています。
ソフトマックスアテンションを他のアテンション操作 (正規化なしのシグモイドアテンションなど) に置き換えることによってそのような依存性を緩和した後、1B パラメーターまでの LM ではアテンションシンクは出現しません。
コードは https://github.com/sail-sg/Attendance-Sink で入手できます。

要約(オリジナル)

Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known as attention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate how optimization, data distribution, loss function, and model architecture in LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases, storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens’ inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters. The code is available at https://github.com/sail-sg/Attention-Sink.

arxiv情報

著者	Xiangming Gu,Tianyu Pang,Chao Du,Qian Liu,Fengzhuo Zhang,Cunxiao Du,Ye Wang,Min Lin
発行日	2024-10-14 17:50:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

When Attention Sink Emerges in Language Models: An Empirical View

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー