Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

要約

大規模な言語モデル（LLMS）の効率的な実世界の展開は、長い出力を処理および生成するためにキー価値（kV）キャッシングに依存しており、繰り返し計算の必要性を減らします。
大きなコンテキストの場合、キー価値のキャッシュは、各トークンとレイヤーのベクトル表現を保存するため、デバイスメモリのギガバイトを数十ギガバイトにすることができます。
最近の研究では、キャッシュされたベクトルが量子化、剪定、またはマージによって圧縮される可能性があることが示されていますが、これらの手法はしばしばより高い圧縮速度に向けて質を損ないます。
この作業では、2つの観察結果を活用することにより、キーと値の圧縮を改善することを目指しています。1）キーと異なるレイヤーの値の固有の依存関係、および2）内部ネットワーク状態の高圧縮メカニズム。
Aqua-KVは、キーと値の間の既存の依存関係を活用するためにコンパクトアダプターに依存するキー価値キャッシュの適応量子化であり、予測できない情報を「最適に」圧縮することを目的としています。
Aqua-KVは、最先端のLLMファミリーの高精度を維持しながら、圧縮率を大幅に改善します。
LLAMA 3.2 LLMSでは、1ドル未満の値あたり2〜2.5ビットでほぼ紛れもない推論を達成し、困惑とロングベンチスコアで$ 1 \％$の相対エラーを達成しています。
Aqua-KVは、ワンショット、シンプル、効率的です。70Bモデルであっても、1〜6時間以内に1つのGPUで調整できます。

要約(オリジナル)

Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) high-compression mechanisms for internal network states. We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to ‘optimally’ compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under $1\%$ relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

arxiv情報

著者	Alina Shutova,Vladimir Malinovskii,Vage Egiazarian,Denis Kuznedelev,Denis Mazur,Nikita Surkov,Ivan Ermakov,Dan Alistarh
発行日	2025-01-31 18:47:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー