Why do LLMs attend to the first token?

要約

大規模な言語モデル（LLMS）は、シーケンスの最初のトークンに大きく出席する傾向があり、いわゆる注意シンクを作成します。
多くの作品がこの現象を詳細に研究しており、それを活用または緩和するさまざまな方法を提案しています。
注意シンクは、量子化の困難、セキュリティの問題、およびストリーミングの注意に関連しています。
しかし、多くの作品はそれらが発生するかどうかにかかわらず条件を提供していますが、重要な質問は浅く答えられたままです。なぜLLMはそのようなパターンを学び、それらがどのように使用されているのですか？
この作業では、このメカニズムがLLMSがオーバーミックスを避ける方法を提供し、これを変圧器での情報を伝播する方法を数学的に研究する既存の作業に接続することを理論的および経験的に主張します。
実験を実施して、理論的な直観を検証し、コンテキストの長さ、深さ、データパッキングなどの選択がシンクの動作にどのように影響するかを示します。
この研究が、注意シンクがLLMSで役立つ理由に関する新しい実用的な視点を提供し、トレーニング中に形成される注意パターンをよりよく理解することを願っています。

要約(オリジナル)

Large Language Models (LLMs) tend to attend heavily to the first token in the sequence — creating a so-called attention sink. Many works have studied this phenomenon in detail, proposing various ways to either leverage or alleviate it. Attention sinks have been connected to quantisation difficulties, security issues, and streaming attention. Yet, while many works have provided conditions in which they occur or not, a critical question remains shallowly answered: Why do LLMs learn such patterns and how are they being used? In this work, we argue theoretically and empirically that this mechanism provides a method for LLMs to avoid over-mixing, connecting this to existing lines of work that study mathematically how information propagates in Transformers. We conduct experiments to validate our theoretical intuitions and show how choices such as context length, depth, and data packing influence the sink behaviour. We hope that this study provides a new practical perspective on why attention sinks are useful in LLMs, leading to a better understanding of the attention patterns that form during training.

arxiv情報

著者	Federico Barbero,Álvaro Arroyo,Xiangming Gu,Christos Perivolaropoulos,Michael Bronstein,Petar Veličković,Razvan Pascanu
発行日	2025-05-13 16:38:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Why do LLMs attend to the first token?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー