BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

要約

大規模な言語モデルは、さまざまなタスクにわたって顕著な能力を実証しており、主に多様に調達されたデータの利用に起因しています。
ただし、モデルのパフォーマンスに対するデータ組成前のデータ構成の影響は、依然としてよく理解されていません。
このペーパーでは、LLMプレトレーニングにおけるドメインの割合とデータボリュームの共同スケーリング挙動をモデル化する新しい二変量データ混合法である$ \ textBf {bimix} $を紹介します。
$ \ textbf {bimix} $は、多様なドメイン全体でデータの混合を理解し、最適化するための体系的なフレームワークを提供します。
2つの大規模なデータセットでの広範な実験を通じて、$ \ textBf {bimix} $の損失の外挿（平均相対誤差<0.2％）の高精度と、目に見えない混合（r $ {}^{2} $への一般化を示します。 > 0.97）。
ドメインの割合を最適化すると、既存の方法と比較して優れたモデルのパフォーマンスが得られます。
さらに、データミキシングの効率的なプロキシとしてエントロピーベースの測定を確立し、計算的に軽量戦略を提供します。
私たちの仕事は、ダイナミクスを混合するデータとLLMトレーニング効率を高めるための実用的なツールに関する理論的洞察の両方に貢献し、言語モデル開発におけるより効果的なスケーリング戦略への道を開いています。

要約(オリジナル)

Large language models have demonstrated remarkable capabilities across various tasks, primarily attributed to the utilization of diversely sourced data. However, the impact of pretraining data composition on model performance remains poorly understood. This paper introduces $\textbf{BiMix}$, a novel bivariate data mixing law that models the joint scaling behavior of domain proportions and data volume in LLM pretraining. $\textbf{BiMix}$ provides a systematic framework for understanding and optimizing data mixtures across diverse domains. Through extensive experiments on two large-scale datasets, we demonstrate $\textbf{BiMix}$’s high accuracy in loss extrapolation (mean relative error < 0.2%) and its generalization to unseen mixtures (R${}^{2}$ > 0.97). Optimization of domain proportions yields superior model performance compared to existing methods. Furthermore, we establish entropy-based measures as efficient proxies for data mixing, offering a computationally lightweight strategy. Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency, paving the way for more effective scaling strategies in language model development.

arxiv情報

著者	Ce Ge,Zhijian Ma,Daoyuan Chen,Yaliang Li,Bolin Ding
発行日	2025-01-27 11:25:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BiMix: A Bivariate Data Mixing Law for Language Model Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー