DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance

要約

大規模な言語モデル（LLMS）の推論中に記憶負担を軽減するために、注意のスパース性などの側面を調査することにより、KVキャッシュの圧縮に多くの研究が焦点を合わせています。
これらの手法は、多くの場合、事前に定義されたKV予算で設計されています。
ただし、最適な予算は、入力の長さとタスクタイプが異なることによって変化するため、固定予算の存在により、パフォーマンスが一貫性のないパフォーマンスが多様なドメインの入力を受け入れる可能性があります。
この制限に対処するために、新しいKVキャッシュ圧縮目的を提案します。特定の入力に関係なく、常にフルキャッシュのパフォーマンスを確保し、KVキャッシュ剪定を可能な限り最大化します。
この目標を達成するために、DBUDGETKVと呼ばれる新しいKVキャッシュ圧縮法を導入します。これは、残りのKVキャッシュがフルキャッシュパフォーマンスと一致する可能性が低いときに信号を送信するための注意ベースのメトリックを特徴とし、剪定プロセスを停止します。
多様なコンテキストの長さ、タスクタイプ、およびモデルサイズにまたがる経験的評価は、私たちの方法が平均で25％の圧縮比を超えるロスレスKV剪定を効果的かつ堅牢に達成することを示唆しています。
さらに、私たちの方法は、メモリ空間を最適化するだけでなく、既存の方法と比較して推論時間の短縮を示しているLLM推論内に簡単に統合できます。

要約(オリジナル)

To alleviate memory burden during inference of large language models (LLMs), numerous studies have focused on compressing the KV cache by exploring aspects such as attention sparsity. These techniques are often designed with a pre-defined KV budget; however, as the optimal budget varies by different input lengths and task types, the existence of a fixed budget could result in inconsistent performance accepting inputs of diverse domains. To address this limitation, we propose a new KV cache compression objective: to always ensure the full-cache performance regardless of specific inputs, while maximizing KV cache pruning as much as possible. To achieve this goal, we introduce a novel KV cache compression method dubbed DBudgetKV, which features an attention-based metric to signal when the remaining KV cache is unlikely to match the full-cache performance, then halting the pruning process. Empirical evaluation spanning diverse context lengths, task types, and model sizes suggests that our method achieves lossless KV pruning effectively and robustly, exceeding 25% compression ratio on average. Furthermore, our method is easy to integrate within LLM inference, not only optimizing memory space, but also showing reduced inference time compared to existing methods.

arxiv情報

著者	Xuanfan Ni,Liyan Xu,Chenyang Lyu,Longyue Wang,Mo Yu,Lemao Liu,Fandong Meng,Jie Zhou,Piji Li
発行日	2025-06-09 15:31:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー