Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

要約

近年、大規模言語モデル (LLM) の急速な開発により、さまざまなタスクにわたって目覚ましいパフォーマンスが達成されています。
ただし、調査によると、LLM はジェイルブレイク攻撃に対して脆弱であり、敵対者が綿密に作成したプロンプトを通じて有害なコンテンツの生成を誘導する可能性があります。
この脆弱性は、LLM の安全な使用と推進に重大な課題をもたらします。
既存の防御方法はさまざまな観点からの保護を提供しますが、有効性が不十分であったり、モデルの機能に重大な影響を与えたりすることがよくあります。
この論文では、プラグアンドプレイで展開が簡単なジェイルブレイク防御フレームワーク、つまりプレフィックスガイダンス (PG) を提案します。これは、モデルの出力の最初のいくつかのトークンを直接設定することで、有害なプロンプトを識別するようにモデルをガイドします。
このアプローチでは、モデル固有のセキュリティ機能と外部分類子を組み合わせて、ジェイルブレイク攻撃を防御します。
3 つのモデルと 5 つの攻撃方法にわたって PG の有効性を実証します。
ベースラインと比較して、私たちのアプローチは一般に平均してより効果的です。
さらに、Just-Eval ベンチマークの結果は、モデルのパフォーマンスを維持する上で PG の優位性をさらに裏付けています。
私たちのコードは https://github.com/weiyezhimeng/Prefix-Guidance で入手できます。

要約(オリジナル)

In recent years, the rapid development of large language models (LLMs) has achieved remarkable performance across various tasks. However, research indicates that LLMs are vulnerable to jailbreak attacks, where adversaries can induce the generation of harmful content through meticulously crafted prompts. This vulnerability poses significant challenges to the secure use and promotion of LLMs. Existing defense methods offer protection from different perspectives but often suffer from insufficient effectiveness or a significant impact on the model’s capabilities. In this paper, we propose a plug-and-play and easy-to-deploy jailbreak defense framework, namely Prefix Guidance (PG), which guides the model to identify harmful prompts by directly setting the first few tokens of the model’s output. This approach combines the model’s inherent security capabilities with an external classifier to defend against jailbreak attacks. We demonstrate the effectiveness of PG across three models and five attack methods. Compared to baselines, our approach is generally more effective on average. Additionally, results on the Just-Eval benchmark further confirm PG’s superiority to preserve the model’s performance. our code is available at https://github.com/weiyezhimeng/Prefix-Guidance.

arxiv情報

著者	Jiawei Zhao,Kejiang Chen,Xiaojian Yuan,Weiming Zhang
発行日	2024-08-22 17:21:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー