A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

要約

ChatGPT や GPT-4 などの大規模言語モデル (LLM) は、有用で安全な応答を提供するように設計されています。
ただし、「ジェイルブレイク」として知られる敵対的なプロンプトはセーフガードを回避し、LLM が潜在的に有害なコンテンツを生成する可能性があります。
ジェイルブレイクのプロンプトを調査することは、LLM の弱点をより適切に明らかにし、LLM をセキュリティで保護するようさらに導くのに役立ちます。
残念ながら、既存のジェイルブレイク方法は、複雑な手動設計が必要か、他のホワイトボックスモデルでの最適化が必要であり、汎用性や効率性が損なわれます。
このペーパーでは、脱獄プロンプト攻撃を (1) プロンプト書き換えと (2) シナリオのネストという 2 つの側面に一般化します。
これに基づいて、LLM 自体を利用して効果的な脱獄プロンプトを生成する自動フレームワークである ReNeLLM を提案します。
広範な実験により、ReNeLLM は既存のベースラインと比較して時間コストを大幅に削減しながら、攻撃の成功率を大幅に向上させることが実証されました。
私たちの研究では、LLM を保護する現在の防御方法が不十分であることも明らかになりました。
最後に、LLM 防御の失敗を即時実行優先の観点から分析し、対応する防御戦略を提案します。
私たちの研究が、学術コミュニティと LLM 開発者の両方にとって、より安全でより規制された LLM の提供に向けた触媒となることを願っています。
コードは https://github.com/NJUNLP/ReNeLLM で入手できます。

要約(オリジナル)

Large Language Models (LLMs), such as ChatGPT and GPT-4, are designed to provide useful and safe responses. However, adversarial prompts known as ‘jailbreaks’ can circumvent safeguards, leading LLMs to generate potentially harmful content. Exploring jailbreak prompts can help to better reveal the weaknesses of LLMs and further steer us to secure them. Unfortunately, existing jailbreak methods either suffer from intricate manual design or require optimization on other white-box models, which compromises either generalization or efficiency. In this paper, we generalize jailbreak prompt attacks into two aspects: (1) Prompt Rewriting and (2) Scenario Nesting. Based on this, we propose ReNeLLM, an automatic framework that leverages LLMs themselves to generate effective jailbreak prompts. Extensive experiments demonstrate that ReNeLLM significantly improves the attack success rate while greatly reducing the time cost compared to existing baselines. Our study also reveals the inadequacy of current defense methods in safeguarding LLMs. Finally, we analyze the failure of LLMs defense from the perspective of prompt execution priority, and propose corresponding defense strategies. We hope that our research can catalyze both the academic community and LLMs developers towards the provision of safer and more regulated LLMs. The code is available at https://github.com/NJUNLP/ReNeLLM.

arxiv情報

著者	Peng Ding,Jun Kuang,Dan Ma,Xuezhi Cao,Yunsen Xian,Jiajun Chen,Shujian Huang
発行日	2024-03-27 13:29:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー