KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

要約

LLM推論中の幾何学的に特徴的なキーが高い注意スコアを持つ傾向があることを実証します。
この現象に基づいて、Keydiffを提案します。KeyDiffは、主要な類似性のみに基づいて、トレーニングフリーKVキャッシュ立ち退き方法です。
他のKVキャッシュの立ち退き方法とは異なり、KeyDiffは厳格なリソース制約内で任意に長いプロンプトを処理し、応答を効率的に生成できます。
キーの多様性を注意スコアに関連付けることにより、KeyDiffの理論的根拠を提供します。
これらの結果は、KeyDiffが保持する最も重要なトークンを効率的に識別できることを意味します。
特に、KeyDiffは注意スコアに依存せず、Flashattentionなどの最適化された注意メカニズムを使用できます。
厳格な記憶手当の下で、Llama 3.1-8BおよびLlama 3.2-3Bのロングベンチの非避けたベースラインから8Kキャッシュ予算（$ \ SIM $ 23％kVキャッシュ削減）で0.04％未満のパフォーマンスギャップを観察することにより、LlamaおよびQwenモデルファミリーのKeyDiffの有効性を実証します。
また、Math500推論ベンチマークでDeepSeek-R1-Distill-Lalama-8Bのベースラインパフォーマンスに近いパフォーマンスを観察し、他のトークンevictionメソッドと比較して、エンドツーエンドの推論のレイテンシを最大30％減少させます。

要約(オリジナル)

We demonstrate that geometrically distinctive keys during LLM inference tend to have high attention scores. Based on the phenomenon we propose KeyDiff, a training-free KV cache eviction method based solely on key similarity. Unlike other KV cache eviction methods, KeyDiff can process arbitrarily long prompts within strict resource constraints and efficiently generate responses. We provide a theoretical basis for KeyDiff by relating key diversity with attention scores. These results imply KeyDiff can efficiently identify the most important tokens to retain. Notably KeyDiff does not rely on attention scores, allowing the use of optimized attention mechanisms like FlashAttention. Under a strict memory allowance, we demonstrate the effectiveness of KeyDiff for the Llama and Qwen model families by observing a performance gap of less than 0.04% with 8K cache budget ($\sim$23% KV cache reduction) from the non-evicting baseline on LongBench for Llama 3.1-8B and Llama 3.2-3B. We also observe near baseline performance for Deepseek-R1-Distill-Llama-8B on the Math500 reasoning benchmark and decrease end-to-end inference latency by up to 30% compared to the other token-eviction methods.

arxiv情報

著者	Junyoung Park,Dalton Jones,Matthew J Morse,Raghavv Goel,Mingu Lee,Chris Lott
発行日	2025-05-20 17:50:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー