Toward Inference-optimal Mixture-of-Expert Large Language Models

要約

最近のMixtralやDeepSeek-MoEのようなMixture-of-Expert(MoE)ベースの大規模言語モデル(LLM)は、密な変換器の2次関数的な学習コストの増加に悩まされることなく、モデルサイズを拡張することに大きな可能性を示している。密なモデルと同様に、MoEを訓練するには、同じ質問に答える必要がある：訓練予算が与えられたとき、モデルサイズとトークン数の最適な配分は何か？我々は、モデル性能、モデルサイズ、データセットサイズ、エキスパート度との関係に関するMoEベースのLLMのスケーリング則を研究する。異なるコンテキストでMoEを研究している先行研究と同様に、我々はエキスパートの数を増やすとリターンが逓減することを観察しているが、これは飽和するまでエキスパートの数をスケールさせるべきことを示唆しているように思われる。我々は、検証損失以外のもう一つの指標として推論効率を導入することで、MoEのスケーリング則を修正することを提案する。少数の(4/8)専門家を持つMoEは、同じ性能の下では最も効率的な解決策であるが、訓練に2.5-3.5倍のコストがかかることがわかった。一方、(16/32)エキスパートMoEを訓練することは、損失最適解よりはるかに小さい(70-85%)が、より大きな訓練データセットで、訓練予算の下で有望な設定である。

要約(オリジナル)

Mixture-of-Expert (MoE) based large language models (LLMs), such as the recent Mixtral and DeepSeek-MoE, have shown great promise in scaling model size without suffering from the quadratic growth of training cost of dense transformers. Like dense models, training MoEs requires answering the same question: given a training budget, what is the optimal allocation on the model size and number of tokens? We study the scaling law of MoE-based LLMs regarding the relations between the model performance, model size, dataset size, and the expert degree. Echoing previous research studying MoE in different contexts, we observe the diminishing return of increasing the number of experts, but this seems to suggest we should scale the number of experts until saturation, as the training cost would remain constant, which is problematic during inference time. We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss. We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training. On the other hand, training a (16/32) expert MoE much smaller (70-85%) than the loss-optimal solution, but with a larger training dataset is a promising setup under a training budget.

arxiv情報

著者	Longfei Yun,Yonghao Zhuang,Yao Fu,Eric P Xing,Hao Zhang
発行日	2024-04-03 16:33:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Toward Inference-optimal Mixture-of-Expert Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー