Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

要約

トランスモデルは、計算および保管リソースを株するキー価値（kV）キャッシュの成長のための非効率的なメモリ割り当てのために、因果言語モデリング（CLM）のスケーラビリティの課題に直面しています。
グループ化されたクエリの注意（GQA）やトークンレベルのKV最適化などの既存の方法は、効率を改善しますが、剛性のあるリソース割り当てに依存し、しばしば「低優先度」トークンを破棄したり、静的にグループ化したりして、トークンの重要性の動的なスペクトルに対処できません。
トークンごとの計算とメモリの割り当てを動的に最適化する新規混合（MOE）アプローチであるMixsgaを提案します。
以前のアプローチとは異なり、Mixsgaはすべてのトークンを保持しながら、KVグループサイズが変化し、粒度と効率のバランスをとる専門の専門家に適応的にルーティングします。
主要なノベルティには、次のものが含まれます。（1）学習された重要性スコアによって導かれたトークンワシの専門家選択ルーティングメカニズム、トークン廃棄なしの比例リソース割り当てを可能にします。
（2）パラメーターのオーバーヘッドを最小限に抑えるためのグループ化された注意投影全体の重量共有。
（3）CLMにおけるトレーニング関心の一貫性のための1ホットのルーティングの決定を確保するための補助的損失。
LLAMA3、Tinyllama、Opt、およびGemma2モデルファミリ全体の広範な評価は、静的ベースラインよりもMixsgaの優位性を示しています。
指導のフォローと継続的な前付のタスクで、Mixsgaは同じKV予算でより高いルージュLと低い困惑を達成します。

要約(オリジナル)

Transformer models face scalability challenges in causal language modeling (CLM) due to inefficient memory allocation for growing key-value (KV) caches, which strains compute and storage resources. Existing methods like Grouped Query Attention (GQA) and token-level KV optimization improve efficiency but rely on rigid resource allocation, often discarding ‘low-priority’ tokens or statically grouping them, failing to address the dynamic spectrum of token importance. We propose mixSGA, a novel mixture-of-expert (MoE) approach that dynamically optimizes token-wise computation and memory allocation. Unlike prior approaches, mixSGA retains all tokens while adaptively routing them to specialized experts with varying KV group sizes, balancing granularity and efficiency. Our key novelties include: (1) a token-wise expert-choice routing mechanism guided by learned importance scores, enabling proportional resource allocation without token discard; (2) weight-sharing across grouped attention projections to minimize parameter overhead; and (3) an auxiliary loss to ensure one-hot routing decisions for training-inference consistency in CLMs. Extensive evaluations across Llama3, TinyLlama, OPT, and Gemma2 model families show mixSGA’s superiority over static baselines. On instruction-following and continued pretraining tasks, mixSGA achieves higher ROUGE-L and lower perplexity under the same KV budgets.

arxiv情報

著者	Guanghui Song,Dongping Liao,Yiren Zhao,Kejiang Ye,Cheng-zhong Xu,Xitong Gao
発行日	2025-06-16 14:30:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー