EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

要約

MoE（Mixture-of-Experts）モデルは、大規模言語モデル（Large Language Models：LLM）の分野において、モデルの性能と計算効率のバランスが取れたアーキテクチャとして注目されている。しかし、一般的な行列乗算（GEMM）演算と大きなパラメータは、計算効率と通信オーバーヘッドに関する課題をもたらし、推論時のスループットのボトルネックとなる。EP,DP,TPのような単一の並列化戦略、あるいはそれらの単純な組み合わせをMoEに適用すると、通常、最適とは言えない推論スループットが達成される。本論文では、既存の並列化スキームを凌駕する、MoE用の新しいエキスパートパイプラインスケジューラであるEPS-MoEを紹介する。我々のアプローチは、異なる負荷に対して最適なGroupGemmとDenseGemmのカーネル実装を動的に選択し、これらの計算と通信を適応的にオーバーラップさせることで、MoEのFeedForward Network (FFN)モジュールの計算を最適化し、スループットの大幅な向上をもたらす。我々の実験結果は、既存の並列推論手法と比較して、プリフィルのスループットが最大で52.4%向上したことを示している。具体的には、我々の手法は、高度に最適化されたDeepSeekV2モデルを、100Kトークン/秒という要求から少なくとも120Kトークン/秒まで高速化した。

要約(オリジナル)

The Mixture-of-Experts (MoE) model has emerged as a prominent architecture in the field of Large Language Models (LLMs), providing a better balance between model performance and computational efficiency. However the General Matrix Multiply (GEMM) operations and large parameters introduce challenges related to computational efficiency and communication overhead, which become throughput bottlenecks during inference. Applying a single parallelism strategy like EP, DP, TP or a straightforward combination of them to MoE usually achieves sub-optimal inference throughput. This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our approach optimizes the computation of MoE FeedForward Network (FFN) modules by dynamically selecting the best kernel implementation of GroupGemm and DenseGemm for different loads and adaptively overlapping these computations with communication, leading to a substantial increase in throughput. Our experimental results demonstrate at most 52.4\% improvement in prefill throughput compared to existing parallel inference methods. Specifically, our method accelerated the highly optimized DeepSeekV2 model from a claimed 100K tokens per second to at least 120K tokens per second.

arxiv情報

著者	Yulei Qian,Fengcun Li,Xiangyang Ji,Xiaoyu Zhao,Jianchao Tan,Kefeng Zhang,Xunliang Cai
発行日	2025-01-03 06:19:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー