Investigating Data Contamination in Modern Benchmarks for Large Language Models

要約

最近の観察では、水増しされたベンチマークスコアと LLM の実際のパフォーマンスとの乖離が浮き彫りになり、評価ベンチマークが汚染される可能性についての懸念が生じています。
この問題は、トレーニングデータの透明性が欠如しているクローズドソースモデルおよび特定のオープンソースモデルにとって特に重要です。
このペーパーでは、オープンソース LLM とプロプライエタリ LLM の両方に合わせた 2 つの方法を提案することにより、データ汚染を研究します。
まず、評価ベンチマークと事前トレーニングコーパスの間の潜在的な重複を調査するために、検索ベースのシステムを導入します。
さらに、オープンモデルと独自モデルの両方に適用できる \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}) という名前の新しい調査プロトコルを紹介します。
このアプローチでは、多肢選択式の質問で間違った回答をマスクし、モデルにギャップを埋めるように促します。
さらに、評価例内のありそうもない単語を曖昧にし、モデルにそれを生成するよう依頼することも含まれます。
特定の商用 LLM が、さまざまなテストセットで欠落しているオプションを驚くほど推測できることがわかりました。
具体的には、TruthfulQA ベンチマークでは、ベンチマークで追加のメタデータを提供すると、LLM のパフォーマンスが顕著に向上することがわかりました。
さらに、MMLU ベンチマークでは、ChatGPT と GPT-4 は、ベンチマークテストデータ内の欠落しているオプションの推測において、それぞれ 52\% と 57\% の完全一致率を示しました。
これらの結果が、この分野におけるより堅牢な評価手法とベンチマークの必要性を強調するものであることを願っています。

要約(オリジナル)

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for closed-source models and certain open-source models where training data transparency is lacking. In this paper we study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We first introduce a retrieval-based system to explore potential overlaps between evaluation benchmarks and pretraining corpora. We further present a novel investigation protocol named \textbf{T}estset \textbf{S}lot Guessing (\textit{TS-Guessing}), applicable to both open and proprietary models. This approach entails masking a wrong answer in a multiple-choice question and prompting the model to fill in the gap. Additionally, it involves obscuring an unlikely word in an evaluation example and asking the model to produce it. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets. Specifically, in the TruthfulQA benchmark, we find that LLMs exhibit notable performance improvement when provided with additional metadata in the benchmark. Further, in the MMLU benchmark, ChatGPT and GPT-4 demonstrated an exact match rate of 52\% and 57\%, respectively, in guessing the missing options in benchmark test data. We hope these results underscore the need for more robust evaluation methodologies and benchmarks in the field.

arxiv情報

著者	Chunyuan Deng,Yilun Zhao,Xiangru Tang,Mark Gerstein,Arman Cohan
発行日	2023-11-16 11:03:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Investigating Data Contamination in Modern Benchmarks for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー