The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

要約

拡散モデルの顕著な生成能力にもかかわらず、最近の研究では、特定のテキストプロンプトが与えられた場合、それらが記憶して有害なコンテンツを作成できることが示されています。
有害な概念を学ぶことでこの問題を軽減するために微調整されたアプローチが開発されましたが、これらの方法は、侵入攻撃によって簡単に回避できます。
これは、有害な概念がモデルから完全に消去されていないことを意味します。
ただし、既存の脱獄攻撃方法は、効果的ですが、非学習されたモデルがまだ概念を保持している理由に関する解釈可能性を欠いているため、防衛戦略の開発が妨げられています。
この作業では、解釈可能な攻撃トークン埋め込みの直交セットを学習する攻撃方法を提案することにより、これらの制限に対処します。
攻撃トークンの埋め込みは、人間の解釈可能なテキスト要素に分解され、非学習されたモデルが暗黙のテキストコンポーネントを通じてターゲット概念を保持していることを明らかにします。
さらに、これらの攻撃トークンの埋め込みは強力で、テキストプロンプト、初期ノイズ、および非学習されたモデルを越えて転送可能であり、未学習モデルは予想よりも脆弱であることを強調しています。
最後に、私たちの解釈可能な攻撃からの洞察に基づいて、私たちは提案されている攻撃と既存の刑務所の両方の攻撃の両方に対して、未学習モデルを保護する防御方法を開発します。
広範な実験結果は、攻撃戦略と防衛戦略の有効性を示しています。

要約(オリジナル)

Despite the remarkable generation capabilities of diffusion models, recent studies have shown that they can memorize and create harmful content when given specific text prompts. Although fine-tuning approaches have been developed to mitigate this issue by unlearning harmful concepts, these methods can be easily circumvented through jailbreaking attacks. This implies that the harmful concept has not been fully erased from the model. However, existing jailbreaking attack methods, while effective, lack interpretability regarding why unlearned models still retain the concept, thereby hindering the development of defense strategies. In this work, we address these limitations by proposing an attack method that learns an orthogonal set of interpretable attack token embeddings. The attack token embeddings can be decomposed into human-interpretable textual elements, revealing that unlearned models still retain the target concept through implicit textual components. Furthermore, these attack token embeddings are powerful and transferable across text prompts, initial noises, and unlearned models, emphasizing that unlearned models are more vulnerable than expected. Finally, building on the insights from our interpretable attack, we develop a defense method to protect unlearned models against both our proposed and existing jailbreaking attacks. Extensive experimental results demonstrate the effectiveness of our attack and defense strategies.

arxiv情報

著者	Siyi Chen,Yimeng Zhang,Sijia Liu,Qing Qu
発行日	2025-06-02 01:10:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー