Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

要約

言語モデルの容量を拡張することは、パフォーマンスを向上させ、新しい機能を解放するための信頼できるアプローチであることが一貫して証明されています。
キャパシティは主に、モデルパラメーターの数とサンプルごとのコンピューティングの 2 つの次元で定義できます。
通常、スケーリングには両方の増加が含まれますが、これらの要素間の正確な相互作用と、それらを組み合わせた全体の容量への寄与はまだ完全には理解されていません。
この関係を、疎な専門家混合モデル (MoE) のコンテキストで調査します。これにより、サンプルごとの FLOP を比例的に増加させることなく、パラメーターの数をスケーリングできます。
私たちは、スパース性レベル、つまり非アクティブパラメーターと合計パラメーターの比率の変化が、事前トレーニングと下流のパフォーマンスの両方の観点からモデルのパフォーマンスにどのように影響するかを調査します。
さまざまな制約 (パラメーターサイズや総トレーニングコンピューティングなど) の下では、トレーニング効率とモデルのパフォーマンスの両方を向上させる最適なレベルのスパース性が存在することがわかりました。
これらの結果は、教育機関のスケーリング法則におけるスパース性の影響をより深く理解し、この分野の既存の研究を補完し、より効率的なアーキテクチャを設計するための洞察を提供します。

要約(オリジナル)

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Expert models (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the ratio of non-active to total parameters, affects model performance in terms of both pretraining and downstream performance. We find that under different constraints (e.g. parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

arxiv情報

著者	Samira Abnar,Harshay Shah,Dan Busbridge,Alaaeldin Mohamed Elnouby Ali,Josh Susskind,Vimal Thilak
発行日	2025-01-21 18:51:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー