A Survey on Large Language Model Acceleration based on KV Cache Management

要約

大規模言語モデル (LLM) は、コンテキストを理解し、論理的推論を実行する能力により、自然言語処理、コンピュータービジョン、マルチモーダルタスクなどの幅広い領域に革命をもたらしました。
ただし、LLM の計算およびメモリの需要 (特に推論時) は、LLM を現実世界のロングコンテキストのリアルタイムアプリケーションに拡張するときに重大な課題を引き起こします。
Key-Value (KV) キャッシュ管理は、冗長な計算を削減し、メモリ使用率を向上させることで LLM 推論を高速化するための重要な最適化手法として浮上しました。
この調査では、LLM アクセラレーションのための KV キャッシュ管理戦略の包括的な概要を提供し、トークンレベル、モデルレベル、システムレベルの最適化に分類しています。
トークンレベルの戦略には、KV キャッシュの選択、予算割り当て、マージ、量子化、低ランク分解が含まれますが、モデルレベルの最適化では、KV の再利用を強化するためのアーキテクチャの革新とアテンションメカニズムに焦点を当てています。
システムレベルのアプローチでは、メモリ管理、スケジューリング、ハードウェアを意識した設計に対処し、多様なコンピューティング環境全体で効率を向上させます。
さらに、この調査では、テキストデータセットとマルチモーダルデータセットの両方の概要と、これらの戦略を評価するために使用されるベンチマークも提供します。
この研究では、詳細な分類と比較分析を提示することで、効率的でスケーラブルな KV キャッシュ管理技術の開発をサポートするための有益な洞察を研究者や実務者に提供し、現実世界のアプリケーションでの LLM の実用的な展開に貢献することを目的としています。
KV キャッシュ管理について厳選された論文リストは次のとおりです: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV
-キャッシュ管理}。

要約(オリジナル)

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.

arxiv情報

著者	Haoyang Li,Yiming Li,Anxin Tian,Tianhao Tang,Zhanchao Xu,Xuejia Chen,Nicole Hu,Wei Dong,Qing Li,Lei Chen
発行日	2025-01-02 03:40:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Survey on Large Language Model Acceleration based on KV Cache Management

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー