MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

要約

モデルのサイズを拡大すると機能が強化されますが、計算の複雑さは大幅に増加します。
専門家混合モデル (MoE) は、トレーニングや推論のコストを大幅に増加させることなくモデルサイズをスケールアップできるようにすることで、この問題に対処します。
MoE にはルーターと呼ばれる重要なモジュールがあり、各トークンを専門家に配布するために使用されます。
現在主流のルーティング方式には動的ルーティングと固定ルーティングがあります。
有望な結果にもかかわらず、MoE モデルはいくつかの課題に直面しています。
主に、動的ルーティング方法の場合、トレーニングトークンが複数のエキスパートに分散すると、特に頻度の低いトークンの場合にアンダーフィッティングが発生する可能性があります。
さらに、固定ルーティング方法はその問題を軽減できますが、表現の多様性が損なわれます。
この論文では、\textbf{M}ixture-\textbf{o}f-\textbf 内でルーティング \textbf{mask}ing 手法を採用することで、トークンレベルの学習を強化するように設計された手法 \textbf{MaskMoE} を提案します。
{E} 専門家モデル。
MaskMoE は、より包括的なトレーニングを実現しながら、表現の多様性を維持できます。
実験結果は、パープレキシティ (PPL) と下流タスクのパフォーマンスの両方の点で、私たちの方法が以前の有力な専門家混合モデルよりも優れていることを示しています。

要約(オリジナル)

Scaling the size of a model enhances its capabilities but significantly increases computation complexity. Mixture-of-Experts models (MoE) address the issue by allowing model size to scale up without substantially increasing training or inference costs. In MoE, there is an important module called the router, which is used to distribute each token to the experts. Currently, the mainstream routing methods include dynamic routing and fixed routing. Despite their promising results, MoE models encounter several challenges. Primarily, for dynamic routing methods, the dispersion of training tokens across multiple experts can lead to underfitting, particularly for infrequent tokens. Additionally, though fixed routing methods can mitigate that issue, they compromise on the diversity of representations. In this paper, we propose \textbf{MaskMoE}, a method designed to enhance token-level learning by employing a routing \textbf{mask}ing technique within the \textbf{M}ixture-\textbf{o}f-\textbf{E}xperts model. MaskMoE is capable of maintaining representation diversity while achieving more comprehensive training. Experimental results demonstrate that our method outperforms previous dominant Mixture-of-Experts models in terms of both perplexity (PPL) and downstream task performance.

arxiv情報

著者	Zhenpeng Su,Zijia Lin,Xue Bai,Xing Wu,Yizhe Xiong,Haoran Lian,Guangyuan Ma,Hui Chen,Guiguang Ding,Wei Zhou,Songlin Hu
発行日	2024-08-29 08:45:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MaskMoE: Boosting Token-Level Learning via Routing Mask in Mixture-of-Experts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー