DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

要約

オープンベンチマークは、再現性と透明性を提供し、大規模な言語モデルの評価と発展に不可欠である。しかし、そのアクセスのしやすさから、テストセット汚染の標的となりやすい。この研究では、バックドア攻撃を利用して、損失、ロジット、モデルの内部詳細へのアクセスを必要とせずに、トレーニング中にベンチマークテストセットを使用したモデルを識別するフレームワークであるDyePackを紹介します。銀行が強盗をマークするためにお金に染料パックを混ぜるように、DyePackはテストデータにバックドアサンプルを混ぜて、そのデータで学習したモデルにフラグを立てる。我々は、確率的なターゲットを持つ複数のバックドアを組み込んだ原理的な設計を提案し、すべてのモデルにフラグを立てる際に正確な偽陽性率（FPR）の計算を可能にする。これにより、検出されたすべての汚染事例に対して強力な証拠を提供しながら、冤罪を証明的に防ぐことができる。DyePackを3つのデータセットで5つのモデルで評価した。多肢選択問題では、8つのバックドアを用いて、MMLU-Proで0.000073%、Big-Bench-Hardで0.000017%という低いFPRを保証し、すべての汚染モデルの検出に成功しました。オープンエンドの生成タスクでは、Alpaca上で6つのバックドアを用いて、0.127%の誤検出率ですべての汚染モデルを検出することができます。

要約(オリジナル)

Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.

arxiv情報

著者	Yize Cheng,Wenxiao Wang,Mazda Moayeri,Soheil Feizi
発行日	2025-06-04 02:31:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー