Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs

要約

Multi-Head Latent Atterness（MLA）は、Keyvue（KV）キャッシュを潜在的なベクトルに大幅に圧縮することにより、効率的かつ経済的な推論を確保するために設計されたDeepSeekによって提案された革新的なアーキテクチャです。
MLAと比較して、マルチヘッドの注意（MHA）を採用している標準LLMと、グループ化されたクエリの注意（GQA）などのバリアントは、かなりのコストの欠点を示します。
よく訓練されたLLMS（LLAMAなど）がゼロから事前トレーニングをせずにMLAに迅速に適応できるようにすることは、有意義で挑戦的です。
このペーパーでは、MHAからMLA（MHA2MLA）に移行するための最初のデータ効率の高い微調整方法を提案します。これには、2つの重要なコンポーネントが含まれています。部分ロープの場合、注意スコアの寸法とキーの寸法からロープを削除します。
低ランク近似については、キーと値の事前に訓練されたパラメーターに基づいて、ジョイントSVD近似を導入します。
これらの慎重に設計された戦略により、MHA2MLAはデータのわずかな割合（0.3％から0.6％）のみを使用してパフォーマンスを回復でき、KVキャッシュ量子化などの圧縮技術とシームレスに統合しながら、推論コストを大幅に削減できます。
たとえば、Llama2-7BのKVキャッシュサイズは92.19％減少し、ロングベンチのパフォーマンスは0.5％しか低下していません。

要約(オリジナル)

Multi-head Latent Attention (MLA) is an innovative architecture proposed by DeepSeek, designed to ensure efficient and economical inference by significantly compressing the Key-Value (KV) cache into a latent vector. Compared to MLA, standard LLMs employing Multi-Head Attention (MHA) and its variants such as Grouped-Query Attention (GQA) exhibit significant cost disadvantages. Enabling well-trained LLMs (e.g., Llama) to rapidly adapt to MLA without pre-training from scratch is both meaningful and challenging. This paper proposes the first data-efficient fine-tuning method for transitioning from MHA to MLA (MHA2MLA), which includes two key components: for partial-RoPE, we remove RoPE from dimensions of queries and keys that contribute less to the attention scores, for low-rank approximation, we introduce joint SVD approximations based on the pre-trained parameters of keys and values. These carefully designed strategies enable MHA2MLA to recover performance using only a small fraction (0.3% to 0.6%) of the data, significantly reducing inference costs while seamlessly integrating with compression techniques such as KV cache quantization. For example, the KV cache size of Llama2-7B is reduced by 92.19%, with only a 0.5% drop in LongBench performance.

arxiv情報

著者	Tao Ji,Bin Guo,Yuanbin Wu,Qipeng Guo,Lixing Shen,Zhan Chen,Xipeng Qiu,Qi Zhang,Tao Gui
発行日	2025-02-20 18:50:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー