Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

要約

Mixture-of-Experts (MoE) 言語モデルは、パフォーマンスを犠牲にすることなく、高密度モデルと比較して計算コストを 2 ～ 4 倍 $ 削減でき、計算が制限されたシナリオでより効率的になります。
ただし、MoE モデルは一般に、高密度モデルと同等のパフォーマンスを達成するには 2 ～ 4$\times$ 倍のパラメーターが必要となるため、より大きな GPU メモリ要件が発生し、自己回帰生成などの I/O 制限のあるシナリオでは MoE モデルの効率が低下します。
この研究では、MoE モデル用の高密度トレーニングとスパース推論のハイブリッドフレームワーク (DS-MoE) を提案します。これは、トレーニング中にすべてのエキスパートにわたって高密度計算を使用し、推論中にスパース計算を使用することにより、強力な計算とパラメーター効率を実現します。
LLM のトレーニングに関する実験では、DS-MoE モデルは標準的な疎 MoE よりもパラメーター効率が高く、総パラメーターサイズとパフォーマンスの点で密モデルと同等でありながら、計算コストが低い (モデルのパラメーターの 30 ～ 40% をアクティブにする) ことが実証されています。
）。
vLLM を使用したパフォーマンステストでは、DS-MoE-6B モデルは、Mistral-7B などの同様の高密度モデルよりも最大 $1.86\times$ 高速に実行され、DeepSeekMoE などの同等の MoE よりも $1.50\time$ から $1.71\time$ 高速に実行されることが示されています。
16B および Qwen1.5-MoE-A2.7B。

要約(オリジナル)

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$\times$ compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4$\times$ times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model’s parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to $1.86\times$ faster than similar dense models like Mistral-7B, and between $1.50\times$ and $1.71\times$ faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B.

arxiv情報

著者	Bowen Pan,Yikang Shen,Haokun Liu,Mayank Mishra,Gaoyuan Zhang,Aude Oliva,Colin Raffel,Rameswar Panda
発行日	2024-04-08 14:39:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー