WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

要約

大規模な言語モデル（LLMS）の長いコンテキスト推論機能の進歩により、KVキャッシュは基礎コンポーネントの1つになりました。
ただし、その大幅なGPUメモリ消費は、KVキャッシュ圧縮により、産業シナリオで効率的なLLM推論を可能にするための重要な手法になります。
最近の研究では、KVキャッシュが占めるメモリの最適化に焦点を当てていますが、2つの重要な要因を見落としています。意味的な一貫性を維持し、圧縮中のタスク固有の特性を検討しています。
これらの制限に対処するために、新しいタスクに適したKVキャッシュウィンドウ選択方法であるWindowKVを提案します。
WindowKVは、タスク固有の特性に従って連続したトークンで構成されるローカルセマンティックウィンドウを動的に選択し、保持されたKVキャッシュが連続的で本質的なコンテキストをキャプチャするようにします。
さらに、グループ内層KVキャッシュインデックス共有戦略を導入して、計算オーバーヘッドを削減し、パフォーマンスと効率のバランスをとっています。
ロングベンチベンチマークでWindokKVを厳密に評価し、結果は、元のKVキャッシュの12％しか使用しない一方で、完全なKVキャッシュ保持に匹敵するパフォーマンスを維持し、メモリ要件を大幅に削減することを示しています。
さらに、私たちの方法は、ヘイスタックの針評価で最新の結果を達成し、その有効性と堅牢性を強調しています。

要約(オリジナル)

With the advancements in long-context inference capabilities of large language models (LLMs), the KV cache has become one of the foundational components. However, its substantial GPU memory consumption makes KV cache compression a key technique for enabling efficient LLM inference in industrial scenarios. While recent studies have focused on optimizing the memory occupied by the KV cache, they overlook two critical factors: preserving semantic coherence and considering task-specific characteristic during compression. To address these limitations, we propose a novel task-adaptive KV cache window selection method, WindowKV. WindowKV dynamically selects local semantic windows consisting of consecutive tokens, according to task-specific characteristics, ensuring the retained KV cache captures continuous, essential context. Additionally, we introduce an intra-group layer KV cache indices sharing strategy to reduce computational overhead, achieving a balance between performance and efficiency. We rigorously evaluate WindowKV on the LongBench benchmark, and the results demonstrate that it maintains a performance comparable to full KV cache retention while using only 12% of the original KV cache, significantly reducing memory requirements. Furthermore, our method also achieves state-of-the-art results in the Needle-in-a-Haystack evaluation, highlighting its effectiveness and robustness.

arxiv情報

著者	Youhui Zuo,Sibo Wei,Chen Zhang,Zhuorui Liu,Wenpeng Lu,Dawei Song
発行日	2025-03-27 14:11:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー