Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

要約

大規模な言語モデルは、トレーニングデータの一部を記憶します。
短い断片や事実を暗記することは、世界に関する質問に答え、あらゆる言語を流暢に話すために必要です。
しかし、モデルは、動機のある攻撃者によって促された場合、記憶されたテキストの長い逐語的シーケンスを再現することも示されています。
この研究では、非敵対的再現と呼ばれる記憶の中間領域を調査し、自然で無害なプロンプトに応答するときのモデル応答と事前トレーニングデータの間の重複を定量化します。
さまざまな無害なプロンプトカテゴリ (手紙やチュートリアルの作成など) について、一般的な会話言語モデルによるテキスト出力の最大 15% がインターネットのスニペットと重複していることを示します。
最悪の場合、コンテンツの 100% が正確にオンラインで見つかる世代も見つかります。
同じタスクの場合、人間が書いたテキストはインターネットデータとの重複がはるかに少ないことがわかります。
私たちはさらに、プロンプト戦略がモデルと人間の間のこの生殖ギャップを埋めることができるかどうかを研究します。
適切なプロンプトにより、平均して非敵対的再生産を減らすことができますが、トレーニングデータの最悪の場合の再生産を軽減するには、たとえ無害なインタラクションであっても、より強力な防御が必要であることがわかりました。

要約(オリジナル)

Large language models memorize parts of their training data. Memorizing short snippets and facts is required to answer questions about the world and to be fluent in any language. But models have also been shown to reproduce long verbatim sequences of memorized text when prompted by a motivated adversary. In this work, we investigate an intermediate regime of memorization that we call non-adversarial reproduction, where we quantify the overlap between model responses and pretraining data when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with snippets from the Internet. In worst cases, we find generations where 100% of the content can be found exactly online. For the same tasks, we find that human-written text has far less overlap with Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. While appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses — even for benign interactions.

arxiv情報

著者	Michael Aerni,Javier Rando,Edoardo Debenedetti,Nicholas Carlini,Daphne Ippolito,Florian Tramèr
発行日	2024-11-15 14:55:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー