FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

要約

大規模言語モデル (LLM) をトレーニングするには、データ品質が非常に重要です。
従来のヒューリスティックフィルターでは、低品質のテキストを見逃したり、貴重なコンテンツを誤って削除したりすることがよくあります。
このペーパーでは、トレーニングデータの品質を向上させるための LLM ベースのラインレベルフィルタリング手法を紹介します。
GPT-4o mini を使用して、FineWeb の 20,000 ドキュメントサンプルを行レベルでラベル付けし、モデルが低品質の行に対して説明的なラベルを作成できるようにします。
これらのラベルは 9 つの主要カテゴリにグループ化されており、フィルタリングを FineWeb の 10B トークンのサブセットにスケールするように DeBERTa-v3 分類器をトレーニングします。
フィルタリングの影響をテストするために、元のデータセットとフィルタリングされたデータセットの両方で GPT-2 モデルをトレーニングします。
結果は、フィルタリングされたデータでトレーニングされたモデルが、HellaSwag ベンチマークでより高い精度を達成し、データが最大 25\% 少ない場合でも、パフォーマンス目標をより早く達成できることを示しています。
これは、LLM ベースのラインレベルフィルタリングが LLM のデータ品質とトレーニング効率を大幅に向上できることを示しています。
この分野でのさらなる研究をサポートするために、品質アノテーション付きデータセット FinerWeb-10BT とコードベースをリリースします。

要約(オリジナル)

Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25\% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.

arxiv情報

著者	Erik Henriksson,Otto Tarkka,Filip Ginter
発行日	2025-01-13 13:26:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー