Predictable Scale: Part I — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

要約

多様なタスクにわたる大規模な言語モデル（LLM）の印象的な機能は現在確立されていますが、それらの効果的な展開には慎重なハイパーパラメーターの最適化が必要です。
多様な構成全体のグリッド検索を含む広範な経験的研究を通じて、これらのハイパーパラメーターを管理するユニバーサルスケーリング法則を発見します。最適な学習レートは、モデルパラメーターとデータサイズの両方とのパワーロー関係に従いますが、最適なバッチサイズは主にデータサイズでスケーリングします。
私たちの分析により、固定モデルとデータサイズの条件下でのハイパーパラメーターの凸状の最適化ランドスケープが明らかになりました。
この凸性は、最適なハイパーパラメータープラトーを意味します。
コミュニティに普遍的でプラグアンドプレイの最適なハイパーパラメーターツールを提供しています。
テストセットの推定値は、徹底的な検索で見つかった世界的に最適なLLMパフォーマンスからわずか0.07 \％です。
これらの法則は、モデルの球位、トレーニングデータ分布、モデルの形状の変動にわたる顕著な堅牢性を示しています。
私たちの最もよく知られている人にとって、これは、混合物モデルや密な変圧器など、さまざまなモデルの形状と構造を統一し、多様なデータ分布全体で最適なハイパーパラメータースケーリング法則を確立する最初の作業です。
この徹底的な最適化プロセスには、約100万個のNVIDIA H800 GPU時間を利用して、3,700 LLMのさまざまなサイズとハイパーパラメーターをゼロから訓練し、合計で約100兆個のトークンを消費するかなりの計算リソースを必要とします。
再現性とさらなる研究を促進するために、指定されたリポジトリhttps://step-law.github.io/を介してすべての損失測定とモデルチェックポイントを徐々にリリースします。

要約(オリジナル)

The impressive capabilities of Large Language Models (LLMs) across diverse tasks are now well-established, yet their effective deployment necessitates careful hyperparameter optimization. Through extensive empirical studies involving grid searches across diverse configurations, we discover universal scaling laws governing these hyperparameters: optimal learning rate follows a power-law relationship with both model parameters and data sizes, while optimal batch size scales primarily with data sizes. Our analysis reveals a convex optimization landscape for hyperparameters under fixed models and data size conditions. This convexity implies an optimal hyperparameter plateau. We contribute a universal, plug-and-play optimal hyperparameter tool for the community. Its estimated values on the test set are merely 0.07\% away from the globally optimal LLM performance found via an exhaustive search. These laws demonstrate remarkable robustness across variations in model sparsity, training data distribution, and model shape. To our best known, this is the first work that unifies different model shapes and structures, such as Mixture-of-Experts models and dense transformers, as well as establishes optimal hyperparameter scaling laws across diverse data distributions. This exhaustive optimization process demands substantial computational resources, utilizing nearly one million NVIDIA H800 GPU hours to train 3,700 LLMs of varying sizes and hyperparameters from scratch and consuming approximately 100 trillion tokens in total. To facilitate reproducibility and further research, we will progressively release all loss measurements and model checkpoints through our designated repository https://step-law.github.io/

arxiv情報

著者	Houyi Li,Wenzheng Zheng,Jingcheng Hu,Qiufeng Wang,Hanshan Zhang,Zili Wang,Yangshijie Xu,Shuigeng Zhou,Xiangyu Zhang,Daxin Jiang
発行日	2025-03-06 18:58:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Predictable Scale: Part I — Optimal Hyperparameter Scaling Law in Large Language Model Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー