AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

要約

大規模な言語モデル（LLMS）の開発により、キー値（kV）キャッシュ圧縮による効率的な推論は、特に長いコンテストの生成にかなりの注目を集めています。
KVキャッシュを圧縮するために、最近のメソッドは、注意スコアでヒューリスティックなランキングを通じて重要なKVトークンを特定します。
ただし、これらの方法は、注意スコアの\ textit {時間パターン}を無視して、LLMパフォーマンスで顕著な分解をもたらすため、重要なトークンを正確に決定するのに苦労しています。
この課題に対処するために、最初の学習ベースの重要なトークン識別アプローチであるAttentionPredictorを提案します。
具体的には、AttentionPredictorは軽量の畳み込みモデルを学習して、時空間パターンをキャプチャし、次のトークンの注意スコアを予測します。
AttentionPredictorの魅力的な機能は、無視できるメモリを消費しながら注意スコアを正確に予測することです。
さらに、トークンの推定時間のオーバーヘッドを隠してデコード段階を加速するクロストークンのクリティカルキャッシュプリフェッチフレームワークを提案します。
注意情報のほとんどを保持することにより、AttentionPredictorは、同等のLLMパフォーマンスで16 $ \ Times $ KVキャッシュ圧縮を達成し、最先端を大幅に上回ります。

要約(オリジナル)

With the development of large language models (LLMs), efficient inference through Key-Value (KV) cache compression has attracted considerable attention, especially for long-context generation. To compress the KV cache, recent methods identify critical KV tokens through heuristic ranking with attention scores. However, these methods often struggle to accurately determine critical tokens as they neglect the \textit{temporal patterns} in attention scores, resulting in a noticeable degradation in LLM performance. To address this challenge, we propose AttentionPredictor, which is the first learning-based critical token identification approach. Specifically, AttentionPredictor learns a lightweight convolution model to capture spatiotemporal patterns and predict the next-token attention score. An appealing feature of AttentionPredictor is that it accurately predicts the attention score while consuming negligible memory. Moreover, we propose a cross-token critical cache prefetching framework that hides the token estimation time overhead to accelerate the decoding stage. By retaining most of the attention information, AttentionPredictor achieves 16$\times$ KV cache compression with comparable LLM performance, significantly outperforming the state-of-the-art.

arxiv情報

著者	Qingyue Yang,Jie Wang,Xing Li,Zhihai Wang,Chen Chen,Lei Chen,Xianzhi Yu,Wulong Liu,Jianye Hao,Mingxuan Yuan,Bin Li
発行日	2025-02-06 13:41:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AttentionPredictor: Temporal Pattern Matters for Efficient LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー