Investigating Data Contamination for Pre-training Language Models

要約

Web スケールのコーパスで事前トレーニングされた言語モデルは、さまざまな下流タスクで優れた機能を実証します。
しかし、そのような機能は、人為的にパフォーマンスを向上させる方法で、事前トレーニングコーパスに含まれる評価データセット (\textit{データコンタミネーション} として知られる現象) から生じるのではないかという懸念が高まっています。
この潜在的な汚染が下流のタスクにおける LM のパフォーマンスにどのような影響を与えるかについてはほとんど理解されていません。
このペーパーでは、一連の GPT-2 モデルを \textit{ゼロから} 事前トレーニングすることにより、事前トレーニング段階でのデータ汚染の影響を調査します。
評価データからのテキスト汚染 (\textit{i.e}\ 評価サンプルの入力テキスト) とグラウンドトゥルース汚染 (\textit{i.e}\ 入力時に尋ねられるプロンプトと目的の出力) の両方の影響を強調します。
また、さまざまな下流タスクに対する繰り返しの汚染の影響も調査します。
さらに、現在の LLM レポート内で広く普及している N-gram ベースの汚染定義を調査し、その限界と不十分さを正確に指摘します。
私たちの調査結果は、言語モデルの機能に対するデータ汚染の影響について新たな洞察を提供し、LLM 研究における独立した包括的な汚染評価の必要性を強調しています。

要約(オリジナル)

Language models pre-trained on web-scale corpora demonstrate impressive capabilities on diverse downstream tasks. However, there is increasing concern whether such capabilities might arise from evaluation datasets being included in the pre-training corpus — a phenomenon known as \textit{data contamination} — in a manner that artificially increases performance. There has been little understanding of how this potential contamination might influence LMs’ performance on downstream tasks. In this paper, we explore the impact of data contamination at the pre-training stage by pre-training a series of GPT-2 models \textit{from scratch}. We highlight the effect of both text contamination (\textit{i.e.}\ input text of the evaluation samples) and ground-truth contamination (\textit{i.e.}\ the prompts asked on the input and the desired outputs) from evaluation data. We also investigate the effects of repeating contamination for various downstream tasks. Additionally, we examine the prevailing n-gram-based definitions of contamination within current LLM reports, pinpointing their limitations and inadequacy. Our findings offer new insights into data contamination’s effects on language model capabilities and underscore the need for independent, comprehensive contamination assessments in LLM studies.

arxiv情報

著者	Minhao Jiang,Ken Ziyu Liu,Ming Zhong,Rylan Schaeffer,Siru Ouyang,Jiawei Han,Sanmi Koyejo
発行日	2024-01-11 17:24:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Investigating Data Contamination for Pre-training Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー