TransMLA: Multi-Head Latent Attention Is All You Need

要約

最新の大規模な言語モデル（LLM）は、純粋に計算上の制約ではなく、現在のハードウェアで通信ボトルネックに遭遇することがよくあります。
マルチヘッド潜在的注意（MLA）は、キー値（kV）層で低ランクマトリックスを使用してこの課題に取り組んでおり、それにより、圧縮された潜在的なKV状態をキャッシュします。
このアプローチは、従来のマルチヘッドの注意と比較してKVキャッシュサイズを大幅に削減し、より速い推論につながります。
さらに、MLAはアッププロジェクションマトリックスを採用して表現力を高め、通信を削減するために追加の計算を取引します。
MLAはDeepSeek V2/V3/R1で効率と有効性を実証していますが、多くの主要なモデルプロバイダーは依然としてグループクエリの注意（GQA）に依存しており、MLAを採用する計画を発表していません。
この論文では、同じKVキャッシュオーバーヘッドを維持しながらGQAを常にMLAで表現できることを示しますが、コンバースは保持されません。
MLAのより広範な使用を促進するために、GQAベースの事前訓練モデル（Llama、Qwen、Mixtralなど）をMLAベースのモデルに変換するトレーニング後の方法であるTransMLAを紹介します。
変換後、モデルはKVキャッシュサイズを増やすことなく表現力を高めるために追加のトレーニングを受けることができます。
さらに、変換されたモデルの低遅延を維持するために、MLA固有の推論加速技術を開発し、DeepSeek R1のより効率的な蒸留を可能にする予定です。

要約(オリジナル)

Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.

arxiv情報

著者	Fanxu Meng,Zengwei Yao,Muhan Zhang
発行日	2025-02-13 18:07:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TransMLA: Multi-Head Latent Attention Is All You Need

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー