SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

要約

DALL-E 2 や Midjourney などの高度なテキストから画像へのモデルは、非常にリアルな画像を生成する能力を備えているため、安全でないコンテンツが蔓延する可能性について大きな懸念が生じています。
これには、政治家の成人向け、暴力的、または欺瞞的な画像が含まれます。
これらのモデルには、作業に安全ではない (NSFW) コンテンツの生成を制限するために厳格な安全メカニズムが実装されているという主張にもかかわらず、私たちは Midjourney に対する最初の即時攻撃を考案して実証することに成功し、その結果、フォトリアリスティックな NSFW 画像が豊富に生成されました。
私たちは、このようなプロンプト攻撃の基本原理を明らかにし、クローズドソースの安全対策を回避するために、疑わしいプロンプト内のリスクの高いセクションを戦略的に置き換えることを提案します。
当社の新しいフレームワークである SurrogatePrompt は、大規模な言語モデル、画像からテキストへ、および画像から画像へのモジュールを利用して攻撃プロンプトを体系的に生成し、攻撃プロンプトの作成を大規模に自動化します。
評価結果では、Midjourney 独自の安全フィルターを攻撃プロンプトでバイパスする成功率 88% が明らかになり、暴力的なシナリオで政治的人物を描いた偽造画像の生成につながりました。
主観的評価と客観的評価の両方により、攻撃プロンプトから生成された画像が重大な安全上の危険をもたらすことが検証されています。

要約(オリジナル)

Advanced text-to-image models such as DALL-E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney’s proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.

arxiv情報

著者	Zhongjie Ba,Jieming Zhong,Jiachen Lei,Peng Cheng,Qinglong Wang,Zhan Qin,Zhibo Wang,Kui Ren
発行日	2023-09-25 13:20:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー