DataMan: Data Manager for Pre-training Large Language Models

要約

データのスケーリング法則によって駆動される大規模な言語モデル（LLM）のパフォーマンスの出現により、トレーニング前のデータの選択がますます重要になります。
ただし、既存の方法は、限られたヒューリスティックと人間の直感に依存しており、包括的で明確なガイドラインがありません。
これに対処するために、私たちは「リバース思考」に触発されています。どの基準がそのパフォーマンスに役立つかを自己識別するよう促します。
トレーニング前の機能は困惑（PPL）に関連しているため、テキストの困惑の異常の原因から14の品質基準を導き出し、ドメインの混合をサポートする15の共通アプリケーションドメインを導入します。
このホワイトペーパーでは、データマネージャー（Dataman）をトレーニングして、ポイントワイズレーティングから品質評価とドメイン認識を学習し、それを使用して、14の品質評価とドメインタイプの447Bトークン前トレーニングコーパスに注釈を付けます。
私たちの実験は、Datamanを使用して30Bトークンを選択して1.3Bパラメーター言語モデルをトレーニングするためにアプローチを検証し、コンテキスト学習（ICL）、困惑、および最先端のベースラインにわたる指導公開能力の大幅な改善を実証します。
全体的なスコアL = 5に基づく最高のパフォーマンスモデルは、均一なサンプリングを使用して50％のデータでトレーニングされたモデルを上回ります。
ドメイン固有のICLパフォーマンスを強化し、Datamanのドメインの混合能力を検証するために、Datamanによって注釈が付けられた高評価のドメイン固有のデータで事前トレーニングを続けます。
私たちの調査結果は、品質ランキングの重要性、品質基準の補完的な性質、および困惑との相関が低いことを強調し、PPLとICLパフォーマンスの間の不整合を分析します。
また、トレーニング前のデータセットを徹底的に分析し、その構成、品質評価の分布、および元のドキュメントソースを調べました。

要約(オリジナル)

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by “reverse thinking” — prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan’s domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

arxiv情報

著者	Ru Peng,Kexin Yang,Yawen Zeng,Junyang Lin,Dayiheng Liu,Junbo Zhao
発行日	2025-02-26 18:01:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DataMan: Data Manager for Pre-training Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー