Predictive Data Selection: The Data That Predicts Is the Data That Teaches

要約

言語モデルの事前学習には、膨大なコーパスを用いた学習が必要であり、データの質が重要な役割を果たす。本研究では、プリトレーニング中のデータの寄与を直接推定し、効率的な方法でプリトレーニングデータを選択することを目指す。具体的には、あるテキストに対する多様なモデルの圧縮効率（すなわち正規化損失）は、テキストドメインが下流のベンチマークと一致する場合、下流の性能と強く相関することを示す最近の知見から着想を得る(Huang et al., 2024)。この観察に基づき、我々は、モデルの損失が下流の能力を予測するデータも学習に効果的に寄与するという仮説を立てる。この洞察を活用するために、我々は予測的データ選択（PreSelect）を導入する。これは、軽量で効率的なデータ選択手法であり、fastTextベースのスコアラーを訓練し配置するだけでよい。1Bと3Bのパラメータモデルを用いた包括的な実験を通じて、PreSelectを用いて選択された30Bのトークンで訓練されたモデルが、300Bのトークンで訓練されたバニラベースラインの性能を上回り、計算要件が10倍削減されることを実証する。さらに、PreSelectは、100Bトークンで訓練された3Bモデルのスケールにおいて、DCLMやFineWeb-Eduなどの他の競合データ選択ベースラインを大幅に上回ります。私たちは、学習したデータ選択スコアラを、キュレーションしたデータセットとともに https://github.com/hkust-nlp/PreSelect でオープンソース化しています。

要約(オリジナル)

Language model pretraining involves training on extensive corpora, where data quality plays a pivotal role. In this work, we aim to directly estimate the contribution of data during pretraining and select pretraining data in an efficient manner. Specifically, we draw inspiration from recent findings showing that compression efficiency (i.e., the normalized loss) of diverse models on certain text correlates strongly with their downstream performance, when the text domain aligns with the downstream benchmarks(Huang et al., 2024). Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning. To leverage this insight, we introduce predictive data selection (PreSelect), a lightweight and efficient data selection method that requires training and deploying only a fastText-based scorer. Through comprehensive experiments with 1B and 3B parameter models, we demonstrate that models trained on 30B tokens selected with PreSelect surpass the performance of the vanilla baseline trained on 300B tokens, achieving a 10x reduction in compute requirements. Furthermore, PreSelect significantly outperforms other competitive data selection baselines, such as DCLM and FineWeb-Edu on a scale of 3B models trained on 100B tokens. We open-source our trained data selection scorer along with the curated datasets at https://github.com/hkust-nlp/PreSelect.

arxiv情報

著者	Kashun Shum,Yuzhen Huang,Hongjian Zou,Qi Ding,Yixuan Liao,Xiaoxin Chen,Qian Liu,Junxian He
発行日	2025-04-04 10:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー