PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

要約

大規模な言語モデル（LLMS）は、自然言語処理において顕著な能力を実証していますが、個人を特定できる情報（PII）を記憶して漏らすことにより、大きなプライバシーリスクをもたらします。
差別的なプライバシーやニューロンレベルの介入などの既存の緩和戦略は、モデルの有用性を低下させるか、漏れを効果的に防止できないことがよくあります。
この課題に対処するために、パフォーマンスを維持しながらPIIの漏れを特定して軽減するためにLLM解釈性技術を活用する新しいプライバシーを提供するフレームワークであるPrivacyscalpelを紹介します。
Privacyscalpelは3つの重要なステップで構成されています。（1）PIIが豊富な表現をコードするモデルのレイヤーを識別する機能プロービング、（2）スパース自動エンコード、K-Sparse Autoencoder（K-sae）disentanglesおよび分離プライバシーに敏感な機能、および（3）標的beced bected afl afl and afl afl able and abl abliationを採用します。
Enronデータセットで微調整されたGemma2-2BとLlama2-7Bに関する経験的評価は、Privacyscalpelが5.15 \％から0.0 \％の低い電子メールの漏れを大幅に削減し、元のモデルの効力の99.4 \％を超えて維持することを示しています。
特に、私たちの方法は、プライバシー – 有効性のトレードオフにおけるニューロンレベルの介入よりも優れているため、まばらで単調な特徴に作用することは、ポリマンティックニューロンを操作するよりも効果的であることを示しています。
LLMプライバシーの改善を超えて、私たちのアプローチは、PIIの暗記の根底にあるメカニズムに関する洞察を提供し、モデルの解釈可能性と安全なAI展開のより広い分野に貢献しています。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing but also pose significant privacy risks by memorizing and leaking Personally Identifiable Information (PII). Existing mitigation strategies, such as differential privacy and neuron-level interventions, often degrade model utility or fail to effectively prevent leakage. To address this challenge, we introduce PrivacyScalpel, a novel privacy-preserving framework that leverages LLM interpretability techniques to identify and mitigate PII leakage while maintaining performance. PrivacyScalpel comprises three key steps: (1) Feature Probing, which identifies layers in the model that encode PII-rich representations, (2) Sparse Autoencoding, where a k-Sparse Autoencoder (k-SAE) disentangles and isolates privacy-sensitive features, and (3) Feature-Level Interventions, which employ targeted ablation and vector steering to suppress PII leakage. Our empirical evaluation on Gemma2-2b and Llama2-7b, fine-tuned on the Enron dataset, shows that PrivacyScalpel significantly reduces email leakage from 5.15\% to as low as 0.0\%, while maintaining over 99.4\% of the original model’s utility. Notably, our method outperforms neuron-level interventions in privacy-utility trade-offs, demonstrating that acting on sparse, monosemantic features is more effective than manipulating polysemantic neurons. Beyond improving LLM privacy, our approach offers insights into the mechanisms underlying PII memorization, contributing to the broader field of model interpretability and secure AI deployment.

arxiv情報

著者	Ahmed Frikha,Muhammad Reza Ar Razi,Krishna Kanth Nakka,Ricardo Mendes,Xue Jiang,Xuebing Zhou
発行日	2025-03-14 09:31:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PrivacyScalpel: Enhancing LLM Privacy via Interpretable Feature Intervention with Sparse Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー