Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

要約

専門家（MOE）の混合は、まばらな専門家の活性化を活用し、パフォーマンスと効率のトレードオフを最適化することにより、大規模な言語モデルをスケーリングするための効果的なアーキテクチャです。
しかし、専門家の並列性の下では、MOEは、トークンから専門家への格付けの不均衡による推論の非効率性に苦しんでおり、一部の専門家は過負荷になり、他の専門家は十分に活用されていないままです。
この不均衡は、リソースの利用率が低下し、レイテンシの増加につながります。最も負担のない専門家が全体的な遅延を決定するため、\ textbf {\ textit {straggler効果}}として定義する現象です。
これを緩和するために、2つの重要な手法を含む容量認定を提案します：（1）\ textBf {\ textIT {容量とアウェアトークンドロップ}}。
十分に活用されていない専門家、トークン分布のバランス。
これらの手法は、高負荷と低負荷の専門家の両方の利用を集合的に最適化し、より効率的なMOE推論パイプラインにつながります。
広範な実験は、当社の方法の有効性を示しており、例えば0.2 \％の平均パフォーマンスの増加と、Mixtral-8 $ 8 $ \ Times $ 7b-Intructの1.94 $ \ Times $ Inference Speepupの大幅な改善を示しています。

要約(オリジナル)

The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textbf{\textit{Straggler Effect}}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textbf{\textit{Capacity-Aware Token Drop}}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textbf{\textit{Capacity-Aware Token Reroute}}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2\% average performance increase and a 1.94$\times$ inference speedup on Mixtral-8$\times$7B-Instruct.

arxiv情報

著者	Shwai He,Weilin Cai,Jiayi Huang,Ang Li
発行日	2025-05-22 17:55:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー