AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

要約

大規模言語モデル (LLM) の安全性調整は、手動のジェイルブレイク攻撃や (自動の) 敵対的攻撃によって侵害される可能性があります。
最近の研究では、これらの攻撃に対して LLM にパッチを適用することが可能であることが示唆されています。手動のジェイルブレイク攻撃は人間が判読できるものですが、多くの場合限定的かつ公開されているため、簡単にブロックできます。
敵対的な攻撃は意味不明なプロンプトを生成しますが、これは複雑さベースのフィルターを使用して検出できます。
この論文では、これらの解決策が楽観的すぎる可能性があることを示します。
私たちは、両方のタイプの攻撃の長所を組み合わせた、解釈可能な敵対的攻撃 \texttt{AutoDAN} を提案します。
手動ジェイルブレイク攻撃のような高い攻撃成功率を維持しながら、複雑さベースのフィルターをバイパスする攻撃プロンプトを自動的に生成します。
これらのプロンプトは解釈可能かつ多様で、手動のジェイルブレイク攻撃で一般的に使用される戦略を示しており、限られたトレーニングデータまたは単一のプロキシモデルを使用する場合、読み取り不可能なプロンプトよりも適切に転送されます。
また、\texttt{AutoDAN} の目的をカスタマイズして、システムプロンプトを漏洩します。これは、敵対的攻撃の文献では取り上げられていない別の脱獄アプリケーションです。
% であり、アプローチの多用途性を示しています。
また、モデルから有害なコンテンツを引き出す機能を超えて、システムプロンプトを漏洩する \texttt{AutoDAN} の目的をカスタマイズすることもでき、アプローチの多用途性を示しています。
私たちの研究は、LLM をレッドチーム化し、脱獄攻撃のメカニズムを理解する新しい方法を提供します。

要約(オリジナル)

Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent work suggests that patching LLMs against these attacks is possible: manual jailbreak attacks are human-readable but often limited and public, making them easy to block; adversarial attacks generate gibberish prompts that can be detected using perplexity-based filters. In this paper, we show that these solutions may be too optimistic. We propose an interpretable adversarial attack, \texttt{AutoDAN}, that combines the strengths of both types of attacks. It automatically generates attack prompts that bypass perplexity-based filters while maintaining a high attack success rate like manual jailbreak attacks. These prompts are interpretable and diverse, exhibiting strategies commonly used in manual jailbreak attacks, and transfer better than their non-readable counterparts when using limited training data or a single proxy model. We also customize \texttt{AutoDAN}’s objective to leak system prompts, another jailbreak application not addressed in the adversarial attack literature. %, demonstrating the versatility of the approach. We can also customize the objective of \texttt{AutoDAN} to leak system prompts, beyond the ability to elicit harmful content from the model, demonstrating the versatility of the approach. Our work provides a new way to red-team LLMs and to understand the mechanism of jailbreak attacks.

arxiv情報

著者	Sicheng Zhu,Ruiyi Zhang,Bang An,Gang Wu,Joe Barrow,Zichao Wang,Furong Huang,Ani Nenkova,Tong Sun
発行日	2023-10-23 17:46:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー