SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

要約

拡散モデルは、強力ですが、不注意に有害または望ましくないコンテンツを生成し、重要な倫理的および安全性の懸念を引き起こす可能性があります。
最近のマシンの学習アプローチは潜在的なソリューションを提供しますが、透明性が欠けていることが多く、基本モデルに導入された変更を理解することは困難です。
この作業では、Saeuronを紹介します。Saeuronは、スパースオートエンコーダー（SAE）によって学習された機能を活用して、テキストから画像への拡散モデルの不要な概念を削除する新しい方法を活用します。
まず、SAEは、拡散モデルの複数の除去タイムステップからの活性化について監視されていない方法で訓練され、特定の概念に対応するまばらで解釈可能な特徴をキャプチャすることを実証します。
これに基づいて、全体的なパフォーマンスを維持しながら、ターゲットコンテンツをブロックしてモデルアクティベーションの正確な介入を可能にする機能選択方法を提案します。
オブジェクトとスタイルの競争力のあるIllerncanvasベンチマークとの評価は、Saeuronの最先端のパフォーマンスを強調しています。
さらに、単一のSAEを使用すると、複数の概念を同時に削除でき、他の方法とは対照的に、Saeuronは敵対的な攻撃を受けても、不要なコンテンツを生成する可能性を軽減することを示しています。
コードとチェックポイントは、https：//github.com/cywinski/saeuronで入手できます。

要約(オリジナル)

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron’s state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.

arxiv情報

著者	Bartosz Cywiński,Kamil Deja
発行日	2025-01-31 18:39:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー