MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

要約

さまざまなタスクにわたる大規模な言語モデルの顕著な能力にもかかわらず、それらの継続的なスケーリングは重要な課題に直面します。高品質の事前トレーニングデータの希少性です。
モデルアーキテクチャは進化し続けていますが、自然言語データはスケールアップに苦労しています。
このボトルネックに取り組むために、\ textbf {ma} ssive \ textbf {g} enre- \ textbf {a} udience〜（maga）再編成法を提案します。
この作業は、3つの主な貢献をしています。（1）MAGA再定式化方法、Corpus拡張前の軽量でスケーラブルなアプローチを提案し、770Bトークンマガコルパスを構築します。
（2）さまざまなデータ予算スケーリング戦略を持つMagacorpusを評価し、さまざまなモデルサイズ（134M-13b）にわたって一貫した改善を実証し、次世代の大規模な合成前登録言語モデルの必要性を確立します。
（3）包括的な分析を通じて、合成トレーニングの崩壊に対する迅速なエンジニアリングの影響を調査し、検証損失を使用した従来の崩壊検出メトリックの制限を明らかにします。
私たちの研究は、MAGAが品質を維持しながらトレーニングデータセットを大幅に拡張できることを示しており、データの制限を超えてモデルをスケーリングするための確実に経路を提供します。

要約(オリジナル)

Despite the remarkable capabilities of large language models across various tasks, their continued scaling faces a critical challenge: the scarcity of high-quality pretraining data. While model architectures continue to evolve, the natural language data struggles to scale up. To tackle this bottleneck, we propose \textbf{MA}ssive \textbf{G}enre-\textbf{A}udience~(MAGA) reformulation method, which systematic synthesizes diverse, contextually-rich pretraining data from existing corpus. This work makes three main contributions: (1) We propose MAGA reformulation method, a lightweight and scalable approach for pretraining corpus expansion, and build a 770B tokens MAGACorpus. (2) We evaluate MAGACorpus with different data budget scaling strategies, demonstrating consistent improvements across various model sizes (134M-13B), establishing the necessity for next-generation large-scale synthetic pretraining language models. (3) Through comprehensive analysis, we investigate prompt engineering’s impact on synthetic training collapse and reveal limitations in conventional collapse detection metrics using validation losses. Our work shows that MAGA can substantially expand training datasets while maintaining quality, offering a reliably pathway for scaling models beyond data limitations.

arxiv情報

著者	Xintong Hao,Ke Shen,Chenggang Li
発行日	2025-02-06 17:19:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MAGA: MAssive Genre-Audience Reformulation to Pretraining Corpus Expansion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー