A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

要約

事前トレーニングは、有能な言語モデル (LM) を開発するための予備的かつ基本的なステップです。
それにもかかわらず、事前トレーニングデータの設計は文書化が決定的に不足しており、多くの場合、経験的に裏付けられていない直感に基づいています。
これに対処するために、28 個の 1.5B パラメーターデコーダー専用モデルを事前トレーニングし、(1) 異なる時点で、(2) さまざまな毒性と品質フィルターを使用して、(3) 異なるドメイン構成で厳選されたデータでトレーニングします。
まず、事前トレーニングデータの古さの影響を定量化します。
評価データと事前トレーニングデータの間の時間的なずれはパフォーマンスの低下につながりますが、これは微調整によって克服することはできません。
次に、品質フィルターと毒性フィルターの効果を調査し、標準ベンチマークでのパフォーマンスと有毒生成のリスクとの間のトレードオフを示します。
私たちの調査結果は、トレーニングデータをフィルタリングするための万能のソリューションが存在しないことを示しています。
また、さまざまな種類のフィルタリングの効果は、テキストドメインの特性からは予測できないこともわかりました。
最後に、書籍や Web などの異種データソースを含めることは広範囲に有益であり、より高い優先順位が必要であることを経験的に検証しています。
これらの発見は、テキストの事前トレーニングに関する多くの文書化されていない直感を検証、定量化し、明らかにするための最大の一連の実験を構成しており、LM 開発におけるより多くの情報に基づいたデータ中心の意思決定をサポートするのに役立つことを期待しています。

要約(オリジナル)

Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.

arxiv情報

著者	Shayne Longpre,Gregory Yauney,Emily Reif,Katherine Lee,Adam Roberts,Barret Zoph,Denny Zhou,Jason Wei,Kevin Robinson,David Mimno,Daphne Ippolito
発行日	2023-11-13 14:50:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー