Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

要約

事前削除データキュレーションは、大規模な言語モデル（LLM）開発の基礎であり、大規模なWebコーパスの品質フィルタリングに関する研究の増加につながります。
統計的品質フラグからLLMベースのラベル付けシステムまで、データセットはカテゴリに分割され、頻繁にバイナリに減少します。貴重な例とみなされるフィルターを渡すものは、役に立たないまたは有害なものとして破棄されます。
ただし、モデルパフォーマンスへのさまざまな種類のテキストの貢献についてのより詳細な理解は、まだほとんど欠けています。
この記事では、登録剤（ジャンルとも呼ばれる）を使用した最初の研究（コーパス言語学で広く使用されている標準である言語の変動をモデル化する）を提示し、事前削除データセットをキュレートし、LLMSのパフォーマンスに対するレジスタの効果を調査します。
登録モデルを分類したデータでモデルをトレーニングし、標準のベンチマークを使用して評価することで比較研究を実行し、事前販売データのレジスタがモデルのパフォーマンスに大きく影響することを示します。
事前トレーニング資料と結果として得られるモデルの間の驚くべき関係を明らかにします。ニュースレジスターを使用すると、レビューや意見ブログなどのテキストをカバーする意見クラスを含む反対に、非常に有益です。
フィルター処理されていないデータセット全体で訓練されたモデルは、単一のレジスタに限定されたデータセットでトレーニングされたものよりも優れています。
さらに、個々のベンチマーク結果を分析すると、特定のレジスタクラスの強度と欠点の重要な違いが、事前化データとして明らかになります。
これらの調査結果は、登録がモデルの変動の重要な説明者であり、より意図的な将来のデータ選択慣行を促進できることを示しています。

要約(オリジナル)

Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labeling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters deemed as valuable examples, others discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilizing registers (also known as genres) – a widely used standard in corpus linguistics to model linguistic variation – to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We perform comparative studies by training models with register classified data and evaluating them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

arxiv情報

著者	Amanda Myntti,Erik Henriksson,Veronika Laippala,Sampo Pyysalo
発行日	2025-04-02 09:30:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー