BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

要約

大規模言語モデル (LLM) は自然言語処理に不可欠ですが、多くの場合、推論速度と計算効率に問題があり、リアルタイムの展開が制限されます。
キーバリュー (KV) キャッシュメカニズムにより、トランスフォーマーモデルの計算オーバーヘッドが削減されますが、コンテキストの理解を維持するという課題は残ります。
この論文では、構造化されたコンテキスト情報を活用して、推論速度を向上させながらキャッシュメモリの使用量を最小限に抑える新しい KV キャッシュアルゴリズム BUZZ を提案します。
BUZZ は蜂の巣構造のスパースキャッシュを採用し、スライディングウィンドウを組み込んで最近の情報を取得し、過去のトークンを動的にチャンクに分割して、ローカル近傍の重要なトークンに優先順位を付けます。
CNN/Daily Mail、XSUM、Wikitext、10-QA の 4 つの現実世界のデータセットで BUZZ を評価します。
私たちの結果は、BUZZ が (1) 長文要約の精度を 99% 以上維持しながら、LLM 推論でのキャッシュメモリ使用量を $\textbf{2.5}\times$ 削減し、(2) 最先端のパフォーマンスを上回ることを示しています。
同じメモリ制限の下で $\textbf{7.69%}$ が回答する複数ドキュメントの質問では、フルキャッシュメソッドでメモリ不足の問題が発生します。
さらに、BUZZ は $\log{n}$ の時間計算量で大幅な推論の高速化を実現します。
コードは https://github.com/JunqiZhao888/buzz-llm で入手できます。

要約(オリジナル)

Large language models (LLMs) are essential in natural language processing but often struggle with inference speed and computational efficiency, limiting real-time deployment. The key-value (KV) cache mechanism reduces computational overhead in transformer models, but challenges in maintaining contextual understanding remain. In this paper, we propose BUZZ, a novel KV caching algorithm that leverages structured contextual information to minimize cache memory usage while enhancing inference speed. BUZZ employs a beehive-structured sparse cache, incorporating a sliding window to capture recent information and dynamically segmenting historical tokens into chunks to prioritize important tokens in local neighborhoods. We evaluate BUZZ on four real-world datasets: CNN/Daily Mail, XSUM, Wikitext, and 10-QA. Our results demonstrate that BUZZ (1) reduces cache memory usage by $\textbf{2.5}\times$ in LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art performance in multi-document question answering by $\textbf{7.69%}$ under the same memory limit, where full cache methods encounter out-of-memory issues. Additionally, BUZZ achieves significant inference speedup with a $\log{n}$ time complexity. The code is available at https://github.com/JunqiZhao888/buzz-llm.

arxiv情報

著者	Junqi Zhao,Zhijin Fang,Shu Li,Shaohui Yang,Shichao He
発行日	2024-10-30 14:53:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー