Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

要約

専門家（MOE）アーキテクチャの混合は、同等の能力の密なモデルと比較して、トレーニングと推論コストを大幅に削減します。
アップサイクリングは、事前に訓練された高密度モデルを使用してMOEモデルを初期化およびトレーニングするアプローチです。
アップサイクリングは初期のパフォーマンスの向上につながりますが、トレーニングはゼロから訓練されたときよりも遅くなり、長期的には最適ではないパフォーマンスにつながります。
ドロップアップサイクリングを提案します – この問題に効果的に対処する方法。
ドロップアップサイクリングは、一見矛盾する2つのアプローチを組み合わせています。事前に訓練された高密度モデルの知識を利用しながら、重量の一部を統計的に再現します。
このアプローチは、専門家の専門化を戦略的に促進し、MOEモデルの知識習得効率を大幅に向上させます。
大規模な大規模な実験は、ドロップアップサイクルが長期的に以前のMOE構築方法を大幅に上回ることを示しています。
その結果、5.9Bのアクティブパラメーターを備えたMOEモデルは、同じモデルファミリで13Bの密なモデルに匹敵するパフォーマンスを実現し、約1/4のトレーニングフロップを必要とします。
ソースコード、トレーニングデータ、モデルチェックポイント、ログを含むすべての実験リソースは、MOEの再現性と将来の研究を促進するために公開されています。

要約(オリジナル)

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling – a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model’s efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

arxiv情報

著者	Taishi Nakamura,Takuya Akiba,Kazuki Fujii,Yusuke Oda,Rio Yokota,Jun Suzuki
発行日	2025-02-26 16:06:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー