Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

要約

メンバーシップ推論攻撃 (MIA) は、ターゲットデータレコードがモデルトレーニングに利用されたかどうかを推論することを目的としています。
これまでの試みでは、MIA を介して言語モデル (LM) のプライバシーリスクが定量化されてきましたが、既存の MIA アルゴリズムが実用的な大規模言語モデル (LLM) で顕著なプライバシー漏洩を引き起こす可能性があるかどうかについては、まだコンセンサスが得られていません。
LM 用に設計された既存の MIA は、リファレンスフリー攻撃とリファレンスベース攻撃の 2 つのカテゴリに分類できます。
どちらも、トレーニングレコードが常に高い確率でサンプリングされるという仮説に基づいています。
それにもかかわらず、この仮説はターゲットモデルの過剰適合に大きく依存しており、これは複数の正則化手法と LLM の一般化によって軽減されます。
参照ベースの攻撃は、ターゲットモデルと参照モデルの間の確率の不一致を比較することによって、より信頼性の高いメンバーシップシグナルを測定する LLM で有望な効果を達成しているようです。
ただし、参照ベースの攻撃のパフォーマンスは、トレーニングデータセットによく似た参照データセットに大きく依存しており、実際のシナリオでは通常アクセスできません。
全体として、既存の MIA は、過剰適合がなくプライベートな実用的な微調整された LLM に対するプライバシー漏洩を効果的に明らかにすることができません。
私たちは、自己調整された確率的変動 (SPV-MIA) に基づくメンバーシップ推論攻撃を提案します。
具体的には、LLM での記憶はトレーニングプロセス中に避けられず、過学習の前に発生するため、過学習ではなく記憶に基づく、より信頼性の高いメンバーシップシグナルである確率的変動を導入します。
さらに、ターゲット LLM 自体をプロンプトすることで参照モデルを微調整するデータセットを構築するセルフプロンプトアプローチを導入します。
このようにして、攻撃者はパブリック API から同様の分布を持つデータセットを収集できます。

要約(オリジナル)

Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Prior attempts have quantified the privacy risks of language models (LMs) via MIAs, but there is still no consensus on whether existing MIA algorithms can cause remarkable privacy leakage on practical Large Language Models (LLMs). Existing MIAs designed for LMs can be classified into two categories: reference-free and reference-based attacks. They are both based on the hypothesis that training records consistently strike a higher probability of being sampled. Nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. The reference-based attack seems to achieve promising effectiveness in LLMs, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. However, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. Overall, existing MIAs are unable to effectively unveil privacy leakage over practical fine-tuned LLMs that are overfitting-free and private. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. Furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs.

arxiv情報

著者	Wenjie Fu,Huandong Wang,Chen Gao,Guanghua Liu,Yong Li,Tao Jiang
発行日	2023-11-10 13:55:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー