TraSCE: Trajectory Steering for Concept Erasure

要約

テキストから画像への拡散モデルの最近の進歩により、このモデルが世間の注目を集め、広くアクセス可能になり、日常のユーザーに受け入れられるようになりました。
ただし、これらのモデルは、作業に安全ではない (NSFW) 画像などの有害なコンテンツを生成することがわかっています。
このような抽象的な概念をモデルから消去するアプローチが提案されていますが、ジェイルブレイク技術はそのような安全対策を回避することに成功しています。
この論文では、有害なコンテンツの生成から拡散の軌道を導くアプローチである TraSCE を提案します。
私たちのアプローチは否定的なプロンプトに基づいていますが、この論文で示すように、従来の否定的なプロンプトは完全な解決策ではなく、まれなケースでは簡単に回避されてしまう可能性があります。
この問題に対処するために、まず従来の否定的なプロンプトの修正を提案します。
さらに、拡散軌道を操縦することによって修正された否定的プロンプト手法を強化する、局所的な損失ベースのガイダンスを導入します。
私たちは、レッドチームによって提案されたものを含む、有害なコンテンツを削除する際のさまざまなベンチマークで、私たちが提案した方法が最先端の結果を達成することを実証します。
芸術的なスタイルやオブジェクトを消去します。
私たちが提案するアプローチでは、トレーニング、重みの変更、トレーニングデータ (画像またはプロンプトの両方) が必要ないため、モデル所有者が新しい概念を消去することが容易になります。

要約(オリジナル)

Recent advancements in text-to-image diffusion models have brought them to the public spotlight, becoming widely accessible and embraced by everyday users. However, these models have been shown to generate harmful content such as not-safe-for-work (NSFW) images. While approaches have been proposed to erase such abstract concepts from the models, jail-breaking techniques have succeeded in bypassing such safety measures. In this paper, we propose TraSCE, an approach to guide the diffusion trajectory away from generating harmful content. Our approach is based on negative prompting, but as we show in this paper, conventional negative prompting is not a complete solution and can easily be bypassed in some corner cases. To address this issue, we first propose a modification of conventional negative prompting. Furthermore, we introduce a localized loss-based guidance that enhances the modified negative prompting technique by steering the diffusion trajectory. We demonstrate that our proposed method achieves state-of-the-art results on various benchmarks in removing harmful content including ones proposed by red teams; and erasing artistic styles and objects. Our proposed approach does not require any training, weight modifications, or training data (both image or prompt), making it easier for model owners to erase new concepts.

arxiv情報

著者	Anubhav Jain,Yuya Kobayashi,Takashi Shibuya,Yuhta Takida,Nasir Memon,Julian Togelius,Yuki Mitsufuji
発行日	2024-12-10 16:45:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TraSCE: Trajectory Steering for Concept Erasure

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー