RegMix: Data Mixture as Regression for Language Model Pre-training

要約

トレーニング前の大規模な言語モデルのデータ混合物はパフォーマンスに大きな影響を与えますが、効果的な混合物を決定する方法は不明のままです。
Regmixに、回帰タスクとして策定することにより、高性能データ混合を自動的に識別することを提案します。
Regmixは、多様なデータ混合物で多くの小さなモデルをトレーニングし、回帰を使用して目に見えない混合の性能を予測し、最適な混合物を適用して、桁違いの多数のコンピューティングで大規模なモデルを訓練します。
RegMixを経験的に検証するために、1Bトークンの1Mパラメーターで512モデルをトレーニングして、回帰モデルに適合し、最適なデータ混合物を予測します。
この混合物を使用して、25Bトークンの1Bパラメーターモデル（つまり、1000倍大きく、25倍長い）をトレーニングします。
さらに、Regmixは、100Bトークンでトレーニングされた最大7Bモデルのモデルを含む実験で、人間の選択よりも一貫して優れていますが、計算リソースの10％を使用してDoremiを一致または超えています。
また、私たちの実験は、（1）データの混合物がパフォーマンスに大きく影響することを示しています。
（2）ウィキペディアのように高品質であると認識されるデータではなく、Webコーパスは、下流のパフォーマンスと最も強い正の相関があります。
（3）ドメインは複雑な方法で相互作用しますが、しばしば常識と矛盾するため、regmixのような自動アプローチが必要です。
（4）データ混合効果はスケーリング法則を超越します。
私たちのコードは、https：//github.com/sail-sg/regmixで入手できます。

要約(オリジナル)

The data mixture for large language model pre-training significantly impacts performance, yet how to determine an effective mixture remains unclear. We propose RegMix to automatically identify a high-performing data mixture by formulating it as a regression task. RegMix trains many small models on diverse data mixtures, uses regression to predict performance of unseen mixtures, and applies the best predicted mixture to train a large-scale model with orders of magnitude more compute. To empirically validate RegMix, we train 512 models with 1M parameters for 1B tokens to fit the regression model and predict the best data mixture. Using this mixture we train a 1B parameter model for 25B tokens (i.e. 1000x larger and 25x longer) which we find performs best among 64 candidate 1B parameter models with other mixtures. Furthermore, RegMix consistently outperforms human selection in experiments involving models up to 7B models trained on 100B tokens, while matching or exceeding DoReMi using just 10% of the computational resources. Our experiments also show that (1) Data mixtures significantly impact performance; (2) Web corpora rather than data perceived as high-quality like Wikipedia have the strongest positive correlation with downstream performance; (3) Domains interact in complex ways often contradicting common sense, thus automatic approaches like RegMix are needed; (4) Data mixture effects transcend scaling laws. Our code is available at https://github.com/sail-sg/regmix.

arxiv情報

著者	Qian Liu,Xiaosen Zheng,Niklas Muennighoff,Guangtao Zeng,Longxu Dou,Tianyu Pang,Jing Jiang,Min Lin
発行日	2025-01-23 17:35:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RegMix: Data Mixture as Regression for Language Model Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー