Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

要約

大規模な言語モデル（LLM）のトレーニング前のデータセットの構成は、ほとんど非公開であり、モデルパフォーマンスの重要なドライバーであるデータ品質を最適化するための透明性と努力を妨げています。
自然言語の品質評価、多様性ベースのフィルター、分類器ベースのアプローチなどの現在のデータ選択方法は、単一次元の評価または冗長性に焦点を当てた戦略によって制限されます。
これらのギャップに対処するために、PRRCを提案して、プロフェッショナリズム、読みやすさ、推論、清潔さを介したデータの品質を評価します。
さらに、学習した最適な重み付けを通じて、これらのディメンションを既存の品質メトリックと統合する多次元データ選択方法であるMeta-Raterを紹介します。
Meta-Raterはプロキシモデルを採用して、検証損失を予測する回帰モデルをトレーニングし、品質スコアの最適な組み合わせの識別を可能にします。
実験は、メタレイターが1.3Bパラメーターモデルの収束速度を2倍にし、下流タスクのパフォーマンスを3.23に改善し、100Bトークンでトレーニングされた3.3Bモデルでスケーラブルな利点が観察されることを示しています。
さらに、データ中心のLLM開発の研究を進めるために、25の品質メトリック（PRRCを含む）にラベル付けされた注釈付きSlimpajama-627Bデータセットをリリースします。
私たちの研究は、全体的で多次元の品質統合が従来の単一次元アプローチを大幅に上回り、トレーニング前の効率とモデル能力を高めるためのスケーラブルなパラダイムを提供することを確立しています。

要約(オリジナル)

The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose PRRC to evaluate data quality across Professionalism, Readability, Reasoning, and Cleanliness. We further introduce Meta-rater, a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with scalable benefits observed in 3.3B models trained on 100B tokens. Additionally, we release the annotated SlimPajama-627B dataset, labeled across 25 quality metrics (including PRRC), to advance research in data-centric LLM development. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability.

arxiv情報

著者	Xinlin Zhuang,Jiahui Peng,Ren Ma,Yinfan Wang,Tianyi Bai,Xingjian Wei,Jiantao Qiu,Chi Zhang,Ying Qian,Conghui He
発行日	2025-05-01 02:37:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー