Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

要約

大規模な言語モデルを人間の価値観や倫理的ガイドラインに合わせるための多大な努力にもかかわらず、これらのモデルは推論能力を悪用した巧妙な脱獄攻撃の影響を受けやすいままである。従来の安全メカニズムは、しばしば明示的な悪意の検出に焦点を当て、より深い脆弱性は未対処のままである。この研究では、非倫理的な応答を引き出すために対照的推論を活用する脱獄技術POATE（Polar Opposite query generation, Adversarial Template construction, and Elaboration）を紹介する。POATEは意味的に正反対の意図を持つプロンプトを生成し、それらを敵対的テンプレートと組み合わせることで、有害な応答を生成するようにモデルを微妙に誘導する。LLaMA3、Gemma2、Phi3、GPT-4を含む、パラメータサイズの異なる6つの多様な言語モデルファミリーに対して広範な評価を行い、既存の手法と比較して有意に高い攻撃成功率（～44%）を達成し、攻撃の頑健性を実証する。提案する攻撃を7つの安全防御に対して評価し、推論ベースの脆弱性に対処する上での限界を明らかにした。これに対抗するため、思考の連鎖プロンプトと逆思考によって推論の頑健性を向上させ、推論主導の敵対的悪用を軽減する防御戦略を提案する。

要約(オリジナル)

Despite significant efforts to align large language models with human values and ethical guidelines, these models remain susceptible to sophisticated jailbreak attacks that exploit their reasoning capabilities. Traditional safety mechanisms often focus on detecting explicit malicious intent, leaving deeper vulnerabilities unaddressed. In this work, we introduce a jailbreak technique, POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), which leverages contrastive reasoning to elicit unethical responses. POATE generates prompts with semantically opposite intents and combines them with adversarial templates to subtly direct models toward producing harmful responses. We conduct extensive evaluations across six diverse language model families of varying parameter sizes, including LLaMA3, Gemma2, Phi3, and GPT-4, to demonstrate the robustness of the attack, achieving significantly higher attack success rates (~44%) compared to existing methods. We evaluate our proposed attack against seven safety defenses, revealing their limitations in addressing reasoning-based vulnerabilities. To counteract this, we propose a defense strategy that improves reasoning robustness through chain-of-thought prompting and reverse thinking, mitigating reasoning-driven adversarial exploits.

arxiv情報

著者	Rachneet Sachdeva,Rima Hazra,Iryna Gurevych
発行日	2025-01-03 15:40:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー