Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

要約

オートレーフレンシャリング言語モデルは、キー価値（kV）キャッシュに依存しており、これにより、世代中に過去の隠れた状態が再構成されないようにし、より速くなります。
モデルのサイズとコンテキストの長さが成長するにつれて、KVキャッシュは重要なメモリボトルネックになり、生成中のサイズを制限する圧縮方法が必要です。
この論文では、注意マップを計算せずに注意スコアを効率的に近似できるようにするクエリ（q）およびキー（k）ベクトルの驚くべき特性を発見します。
単一のコンテキストに依存しない投影に基づいて、より重要ではないキー価値ペアを除去するトレーニングフリーのKVキャッシュ圧縮法であるQ-Filtersを提案します。
多くの代替案とは異なり、Q-filtersは、注意力への直接アクセスを必要としないため、Flashattentionと互換性があります。
長いコンテキスト設定での実験結果は、Q-filterが検索タスクでのSNAPKVなどの注意ベースの圧縮方法と競合する一方で、生成セットアップのストリーミング-LLMなどの効率的な圧縮スキームを常に上回ることを示しています。
特に、Q-Filtersは、X32圧縮レベルのヘイスタックの針タスクで99％の精度を達成し、ストリーミング-LLMと比較してテキスト生成で継続性の低下を最大65％減らします。

要約(オリジナル)

Autoregressive language models rely on a Key-Value (KV) Cache, which avoids re-computing past hidden states during generation, making it faster. As model sizes and context lengths grow, the KV Cache becomes a significant memory bottleneck, which calls for compression methods that limit its size during generation. In this paper, we discover surprising properties of Query (Q) and Key (K) vectors that allow us to efficiently approximate attention scores without computing the attention maps. We propose Q-Filters, a training-free KV Cache compression method that filters out less crucial Key-Value pairs based on a single context-agnostic projection. Contrarily to many alternatives, Q-Filters is compatible with FlashAttention, as it does not require direct access to attention weights. Experimental results in long-context settings demonstrate that Q-Filters is competitive with attention-based compression methods such as SnapKV in retrieval tasks while consistently outperforming efficient compression schemes such as Streaming-LLM in generation setups. Notably, Q-Filters achieves a 99% accuracy in the needle-in-a-haystack task with a x32 compression level while reducing the generation perplexity drop by up to 65% in text generation compared to Streaming-LLM.

arxiv情報

著者	Nathan Godey,Alessio Devoto,Yu Zhao,Simone Scardapane,Pasquale Minervini,Éric de la Clergerie,Benoît Sagot
発行日	2025-03-04 17:37:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Q-Filters: Leveraging QK Geometry for Efficient KV Cache Compression

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー