Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

要約

専門家のサブセットのみをアクティブにすることにより、パフォーマンスと推論効率の間の好ましいトレードオフを実現します。
ただし、すべての専門家を保存するメモリオーバーヘッドは、特にDeepSeek-R1（671b）などの大規模なMOEモデルでは、大きな制限のままです。
この研究では、大規模なMOEモデルにおけるドメインの専門化と専門家の冗長性を調査し、少数のエキスパートのローカリゼーションと呼ぶ一貫した行動を明らかにします。ほんの少しのデモンストレーションで、このモデルは一貫して専門家のサブセットを一貫して活性化します。
この観察に基づいて、私たちは、最も関連性の高い専門家のみを特定して保持するために、いくつかのドメイン固有のデモを活用する、簡単で効果的な剪定フレームワーク、簡単なEPを提案します。
Easy-EPは、2つの重要なコンポーネントで構成されています。出力認識の専門家の重要性評価と専門家レベルのトークン貢献推定。
前者は、活性化された専門家の出力のゲーティングスコアと大きさを考慮することにより、現在のトークンの各専門家の重要性を評価し、後者はルーティングされた専門家の後と前に表現の類似性に基づいてトークンの寄与を評価します。
実験では、私たちの方法が、同じメモリ予算の下で同等のパフォーマンスと2.99ドルのタイムスループットを達成できることを示しています。
私たちのコードは、https：//github.com/rucaibox/easyepで入手できます。

要約(オリジナル)

Mixture-of-Experts (MoE) models achieve a favorable trade-off between performance and inference efficiency by activating only a subset of experts. However, the memory overhead of storing all experts remains a major limitation, especially in large-scale MoE models such as DeepSeek-R1 (671B). In this study, we investigate domain specialization and expert redundancy in large-scale MoE models and uncover a consistent behavior we term few-shot expert localization, with only a few demonstrations, the model consistently activates a sparse and stable subset of experts. Building on this observation, we propose a simple yet effective pruning framework, EASY-EP, that leverages a few domain-specific demonstrations to identify and retain only the most relevant experts. EASY-EP comprises two key components: output-aware expert importance assessment and expert-level token contribution estimation. The former evaluates the importance of each expert for the current token by considering the gating scores and magnitudes of the outputs of activated experts, while the latter assesses the contribution of tokens based on representation similarities after and before routed experts. Experiments show that our method can achieve comparable performances and $2.99\times$ throughput under the same memory budget with full DeepSeek-R1 with only half the experts. Our code is available at https://github.com/RUCAIBox/EASYEP.

arxiv情報

著者	Zican Dong,Han Peng,Peiyu Liu,Wayne Xin Zhao,Dong Wu,Feng Xiao,Zhifeng Wang
発行日	2025-04-09 11:34:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー