Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

要約

大規模な言語モデル（LLM）のトレーニング前のデータセットの構成は、ほとんど非公開であり、モデルパフォーマンスの重要なドライバーであるデータ品質を最適化するための透明性と努力を妨げています。
自然言語の品質評価、多様性ベースのフィルター、分類器ベースのアプローチなどの現在のデータ選択方法は、単一次元の評価または冗長性に焦点を当てた戦略によって制限されます。
これらのギャップに対処するために、データの品質を評価するために4つの側面を提案します：プロフェッショナリズム、読みやすさ、推論、清潔さ。
さらに、学習した最適な重み付けを通じて、これらのディメンションを既存の品質メトリックと統合する多次元データ選択方法であるMeta-Raterを紹介します。
Meta-Raterはプロキシモデルを採用して、検証損失を予測する回帰モデルをトレーニングし、品質スコアの最適な組み合わせの識別を可能にします。
実験は、メタレイターが1.3Bパラメーターモデルの収束速度を2倍にし、下流タスクのパフォーマンスを3.23に改善することを示しています。
私たちの研究は、全体的で多次元の品質統合が従来の単一次元アプローチを大幅に上回り、トレーニング前の効率とモデル能力を高めるためのスケーラブルなパラダイムを提供することを確立しています。
将来の調査を進めるために、https://github.com/opendatalab/meta-raterでスクリプト、データ、モデルをリリースします。

要約(オリジナル)

The composition of pre-training datasets for large language models (LLMs) remains largely undisclosed, hindering transparency and efforts to optimize data quality, a critical driver of model performance. Current data selection methods, such as natural language quality assessments, diversity-based filters, and classifier-based approaches, are limited by single-dimensional evaluation or redundancy-focused strategies. To address these gaps, we propose four dimensions to evaluate data quality: professionalism, readability, reasoning, and cleanliness. We further introduce Meta-rater,a multi-dimensional data selection method that integrates these dimensions with existing quality metrics through learned optimal weightings. Meta-rater employs proxy models to train a regression model that predicts validation loss, enabling the identification of optimal combinations of quality scores. Experiments demonstrate that Meta-rater doubles convergence speed for 1.3B parameter models and improves downstream task performance by 3.23, with advantages that scale to models as large as 7.2B parameters. Our work establishes that holistic, multi-dimensional quality integration significantly outperforms conventional single-dimension approaches, offering a scalable paradigm for enhancing pre-training efficiency and model capability. To advance future research, we release scripts, data, and models at https://github.com/opendatalab/Meta-rater.

arxiv情報

著者	Xinlin Zhuang,Jiahui Peng,Ren Ma,Yinfan Wang,Tianyi Bai,Xingjian Wei,Jiantao Qiu,Chi Zhang,Ying Qian,Conghui He
発行日	2025-06-04 15:35:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー