Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

要約

大規模言語モデルはさまざまな分野で優れていますが、長いシーケンスの推論に必要な Key-Value (KV) キャッシュの拡張により、メモリと時間効率の面で課題に直面しています。
最近の取り組みでは、生成品質を維持しながら、実行時に膨大な非クリティカルなキャッシュ要素を排除することで、KV キャッシュサイズを所定のメモリバジェットまで削減しようとしています。
現在のエビクション方法を再検討すると、それらのメソッドは、マルチヘッドセルフアテンションメカニズムのエビクション前出力とエビクション後出力の間の $L_1$ エビクション損失の上限を根本的に最小限に抑えていることがわかります。
さらに、私たちの分析は、アテンションヘッド全体に予算を均一に割り当てる一般的な慣行が、立ち退き後の生成品質を損なうことを示しています。
これらの発見を踏まえて、シンプルだが効果的な適応型予算配分アルゴリズムを提案します。
このアルゴリズムは、理論上の損失上限を最適化するだけでなく、さまざまなヘッドにわたるさまざまな特性に合わせることにより、実際の $L_1$ 追い出し損失も削減します。
このアルゴリズムを 2 つの最先端の方法に統合することで、KV キャッシュの削除を最適化するための適応的な予算割り当ての使用の有効性を実証します。
16 のデータセットと Needle-in-a-Haystack テストの広範な評価により、さまざまなタスクにわたってパフォーマンスが大幅に向上していることが確認されました。

要約(オリジナル)

Large Language Models have excelled in various fields but encounter challenges in memory and time efficiency due to the expanding Key-Value (KV) cache required for long-sequence inference. Recent efforts try to reduce KV cache size to a given memory budget by evicting vast non-critical cache elements during runtime, while preserving generation quality. Our revisiting of current eviction methods reveals that they fundamentally minimize an upper bound of the $L_1$ eviction loss between the pre- and post-eviction outputs of multi-head self-attention mechanisms. Moreover, our analysis indicates that the common practices of uniformly assigning budgets across attention heads harm their post-eviction generation quality. In light of these findings, we propose a simple yet effective adaptive budget allocation algorithm. This algorithm not only optimizes the theoretical loss upper bound but also reduces the $L_1$ eviction loss in practice by aligning with the varied characteristics across different heads. By integrating this algorithm into two state-of-the-art methods, we demonstrate the effectiveness of using adaptive budget allocation to optimize KV cache eviction. Extensive evaluations on 16 datasets and the Needle-in-a-Haystack test confirm significant performance improvements across various tasks.

arxiv情報

著者	Yuan Feng,Junlin Lv,Yukun Cao,Xike Xie,S. Kevin Zhou
発行日	2024-08-16 08:46:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー