DataDecide: How to Predict Best Pretraining Data with Small Experiments

要約

大規模な言語モデルはさまざまなデータセットで事前に排出されるのに費用がかかるため、コストを削減するためには、小規模な実験を使用してデータを決定することが重要です。
小規模で観察されたパフォーマンスから決定を下すベンチマークと方法は、最高の大きなモデルを生成するデータセットを最も正確に予測しますか？
この質問のオープンな調査を強化するために、データとスケールの違いよりも最も広範なオープンモデルのモデルであるデータデシドのモデル、データ、および評価をリリースします。
さまざまなソース、重複排除、最大100Bのトークン、最大1Bパラメーター、3つのランダムシードまでのフィルタリングを備えた25のコーパスで制御された事前トレーニング実験を実施します。
単一の小さなサイズ（150mパラメーターなど）のモデルのランキングは、より大きなターゲットスケール（1b）で最適なモデルを予測するための強力なベースラインであることがわかります（comパリソンの約80％が正しい）。
8つのベースライン間のスケーリング法の方法は、シングルスケール予測の計算決定フロンティアを超えていませんが、DataDecideは将来のスケーリング法の改善を測定できます。
また、小さな実験のプロキシとして連続尤度メトリックを使用すると、MMLU、ARC、Hellaswag、MBPP、およびHumanValなどのベンチマークが、計算のわずか0.01％でターゲット1Bスケールで予測できる80％を超えることを特定します。

要約(オリジナル)

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide — the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

arxiv情報

著者	Ian Magnusson,Nguyen Tai,Ben Bogin,David Heineman,Jena D. Hwang,Luca Soldaini,Akshita Bhagia,Jiacheng Liu,Dirk Groeneveld,Oyvind Tafjord,Noah A. Smith,Pang Wei Koh,Jesse Dodge
発行日	2025-04-15 17:02:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DataDecide: How to Predict Best Pretraining Data with Small Experiments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー