Data Contamination Through the Lens of Time

要約

大規模言語モデル (LLM) の優れた能力に関する最近の主張は、公開されているベンチマークの評価によって裏付けられることがよくあります。
LLM はインターネットの広範囲でトレーニングを行うため、この手法ではデータ汚染、つまりトレーニングデータに明示的または暗黙的に含まれるサンプルを評価するという懸念が生じます。
データ汚染は、トレーニングデータ、カナリア文字列、類似性の埋め込みなどの制御された実験などの部分的な試みであっても、依然として測定と軽減が難しいことで知られています。
この研究では、GPT モデルのトレーニングカットオフの自然実験を使用して、時間の経過とともにリリースされるベンチマークを確認することにより、LLM におけるデータ汚染の初めての徹底的な縦断分析を実行します。
具体的には、Codeforces と Project Euler という 2 つのコード/数学的問題解決データセットを検討し、汚染の強力な証拠を提供する、LLM 合格率と GitHub の人気およびリリース日の関係における統計的に有意な傾向を見つけました。
データセット、生の結果、評価フレームワークをオープンソース化することで、私たちの研究は最新のモデルにおけるデータ汚染の厳密な分析への道を切り開きます。
最後に、ウェブスケールデータでトレーニングする LLM の時代におけるベンチマークを公開するためのベストプラクティスと今後のステップについて説明します。

要約(オリジナル)

Recent claims about the impressive abilities of large language models (LLMs) are often supported by evaluating publicly available benchmarks. Since LLMs train on wide swaths of the internet, this practice raises concerns of data contamination, i.e., evaluating on examples that are explicitly or implicitly included in the training data. Data contamination remains notoriously challenging to measure and mitigate, even with partial attempts like controlled experimentation of training data, canary strings, or embedding similarities. In this work, we conduct the first thorough longitudinal analysis of data contamination in LLMs by using the natural experiment of training cutoffs in GPT models to look at benchmarks released over time. Specifically, we consider two code/mathematical problem-solving datasets, Codeforces and Project Euler, and find statistically significant trends among LLM pass rate vs. GitHub popularity and release date that provide strong evidence of contamination. By open-sourcing our dataset, raw results, and evaluation framework, our work paves the way for rigorous analyses of data contamination in modern models. We conclude with a discussion of best practices and future steps for publicly releasing benchmarks in the age of LLMs that train on webscale data.

arxiv情報

著者	Manley Roberts,Himanshu Thakur,Christine Herlihy,Colin White,Samuel Dooley
発行日	2023-10-16 17:51:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Data Contamination Through the Lens of Time

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー