Recycled Attention: Efficient inference for long-context language models

要約

長いコンテキストの入力を指定してトークンの長いシーケンスを生成すると、大規模言語モデル (LLM) に大きな計算負荷がかかります。
計算上のボトルネックの 1 つは、各生成ステップでの長い入力シーケンスにわたる計算の注意から発生します。
この論文では、フルコンテキストのアテンションと入力トークンのサブセットに対するアテンションを交互に行う推論時間手法であるリサイクルアテンションを提案します。
部分的なアテンションを実行する場合、完全なアテンションを実行した以前のトークンのアテンションパターンを再利用し、最も注目度の高い上位 K 個のトークンのみに注目し、データ移動とアテンションの計算のコストを削減します。
ローカルコンテキストまたは累積注意スコアが高いトークンのみに注目する以前に提案された推論時間加速法と比較して、私たちのアプローチは現在のデコードステップに関連するトークンを柔軟に選択します。
私たちは、長い文脈の能力を包括的に評価するために設計された一連のタスクである RULER と、長い文脈の言語モデリングタスクでメソッドを評価します。
私たちの方法を既製の LLM に適用すると、ローカルコンテキストのみを考慮するベースラインと同等の高速化が達成され、パフォーマンスが 2 倍向上します。
さらに、パフォーマンスと効率のトレードオフを改善するための 2 つのアイデアを検討します。(1) クエリの類似性に基づいて、リサイクルまたはフルアテンションステップをいつ実行するかを動的に決定する、および (2) リサイクルアテンションを使用してモデルの事前トレーニングを継続する。

要約(オリジナル)

Generating long sequences of tokens given a long-context input imposes a heavy computational burden for large language models (LLMs). One of the computational bottleneck comes from computing attention over a long sequence of input at each generation step. In this paper, we propose Recycled Attention, an inference-time method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we recycle the attention pattern of a previous token that has performed full attention and attend only to the top K most attended tokens, reducing the cost of data movement and attention computation. Compared to previously proposed inference-time acceleration method which attends only to local context or tokens with high accumulative attention scores, our approach flexibly chooses tokens that are relevant to the current decoding step. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context language modeling tasks. Applying our method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further explore two ideas to improve performance-efficiency trade-offs: (1) dynamically decide when to perform recycled or full attention step based on the query similarities and (2) continued pre-training the model with Recycled Attention.

arxiv情報

著者	Fangyuan Xu,Tanya Goyal,Eunsol Choi
発行日	2024-11-08 18:57:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Recycled Attention: Efficient inference for long-context language models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー