Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

要約

Text-to-image (TTI) モデルは多くの革新的なサービスを提供しますが、非倫理的な画像を生成する可能性があるため、倫理的な懸念も生じます。
ほとんどの公共 TTI サービスは、意図しない画像を防止するために安全フィルターを採用しています。
この作品では、DALL-E 3 や Midjourney などの最先端の TTI モデルの安全フィルターを回避する分割統治攻撃を紹介します。
私たちの攻撃では、LLM をテキスト変換エージェントとして利用して、敵対的なプロンプトを作成します。
私たちは、LLM が非倫理的な描画意図を個々の画像要素の複数の無害な記述に分解するように効果的に誘導する攻撃ヘルパープロンプトを設計し、非倫理的な画像を生成しながら安全フィルターをバイパスできるようにします。
なぜなら、潜在的な有害な意味は、すべての個々の要素が一緒に描かれた場合にのみ明らかになるからです。
私たちの評価は、私たちの攻撃が複数の強力なクローズドボックス安全フィルターをうまく回避していることを示しています。
最先端の TTI エンジン DALL-E 3 の安全フィルターをバイパスする DACA の総合的な成功率は 85% を超え、Midjourney V6 をバイパスする成功率は 75% を超えています。
私たちの調査結果は、攻撃障壁が低く、解釈可能性が向上し、防御への適応が優れているため、手動で作成する方法や反復的な TTI モデルクエリを実行する方法よりもセキュリティに深刻な影響を及ぼします。
私たちのプロトタイプは、https://github.com/researchcode001/Divide-and-Conquer- Attack から入手できます。

要約(オリジナル)

Text-to-image (TTI) models offer many innovative services but also raise ethical concerns due to their potential to generate unethical images. Most public TTI services employ safety filters to prevent unintended images. In this work, we introduce the Divide-and-Conquer Attack to circumvent the safety filters of state-of the-art TTI models, including DALL-E 3 and Midjourney. Our attack leverages LLMs as text transformation agents to create adversarial prompts. We design attack helper prompts that effectively guide LLMs to break down an unethical drawing intent into multiple benign descriptions of individual image elements, allowing them to bypass safety filters while still generating unethical images. Because the latent harmful meaning only becomes apparent when all individual elements are drawn together. Our evaluation demonstrates that our attack successfully circumvents multiple strong closed-box safety filters. The comprehensive success rate of DACA bypassing the safety filters of the state-of-the-art TTI engine DALL-E 3 is above 85%, while the success rate for bypassing Midjourney V6 exceeds 75%. Our findings have more severe security implications than methods of manual crafting or iterative TTI model querying due to lower attack barrier, enhanced interpretability , and better adaptation to defense. Our prototype is available at: https://github.com/researchcode001/Divide-and-Conquer-Attack

arxiv情報

著者	Yimo Deng,Huangxun Chen
発行日	2024-03-14 14:01:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Divide-and-Conquer Attack: Harnessing the Power of LLM to Bypass Safety Filters of Text-to-Image Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー