Efficient Streaming Language Models with Attention Sinks

要約

長時間の対話が予想されるマルチラウンド対話などのストリーミングアプリケーションに大規模言語モデル (LLM) を導入することが緊急に必要ですが、2 つの大きな課題が生じます。
まず、デコード段階で、以前のトークンのキーと値の状態 (KV) をキャッシュすると、大量のメモリが消費されます。
第 2 に、一般的な LLM はトレーニングシーケンスの長さよりも長いテキストに一般化できません。
最新の KV のみがキャッシュされるウィンドウアテンションは自然なアプローチですが、テキストの長さがキャッシュサイズを超えると失敗することを示します。
私たちは、初期トークンの KV を維持するとウィンドウアテンションのパフォーマンスが大幅に回復するという興味深い現象、つまりアテンションシンクを観察しました。
この論文では、最初に、たとえ意味的に重要でなくても、「シンク」としての初期トークンに対する強い注意スコアが、注意シンクの出現の原因であることを示します。
上記の分析に基づいて、有限長のアテンションウィンドウでトレーニングされた LLM を微調整せずに無限のシーケンス長に一般化できる効率的なフレームワークである StreamingLLM を紹介します。
StreamingLLM により、Llama-2、MPT、Falcon、および Pythia が最大 400 万のトークンなどを使用して安定かつ効率的な言語モデリングを実行できることを示します。
さらに、事前トレーニング中に専用のアテンションシンクとしてプレースホルダートークンを追加すると、ストリーミングのデプロイメントがさらに改善されることがわかりました。
ストリーミング設定では、StreamingLLM はスライディングウィンドウの再計算ベースラインを最大 22.2 倍高速化します。
コードとデータセットは https://github.com/mit-han-lab/streaming-llm で提供されます。

要約(オリジナル)

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens’ Key and Value states (KV) consumes extensive memory. Secondly, popular LLMs cannot generalize to longer texts than the training sequence length. Window attention, where only the most recent KVs are cached, is a natural approach — but we show that it fails when the text length surpasses the cache size. We observe an interesting phenomenon, namely attention sink, that keeping the KV of initial tokens will largely recover the performance of window attention. In this paper, we first demonstrate that the emergence of attention sink is due to the strong attention scores towards initial tokens as a “sink” even if they are not semantically important. Based on the above analysis, we introduce StreamingLLM, an efficient framework that enables LLMs trained with a finite length attention window to generalize to infinite sequence lengths without any fine-tuning. We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more. In addition, we discover that adding a placeholder token as a dedicated attention sink during pre-training can further improve streaming deployment. In streaming settings, StreamingLLM outperforms the sliding window recomputation baseline by up to 22.2x speedup. Code and datasets are provided at https://github.com/mit-han-lab/streaming-llm.

arxiv情報

著者	Guangxuan Xiao,Yuandong Tian,Beidi Chen,Song Han,Mike Lewis
発行日	2023-09-29 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Streaming Language Models with Attention Sinks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー