Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

要約

大規模な言語モデル（LLM）は、侵入攻撃に対して脆弱であることが知られており、敵は有害または非倫理的な反応を誘発するために慎重に設計されたプロンプトを活用します。
このような脅威は、実際の展開におけるLLMの安全性と信頼性に関する重要な懸念を提起しました。
既存の防衛メカニズムはそのようなリスクを部分的に軽減しますが、敵対的な技術のその後の進歩により、新しい脱獄方法がこれらの保護を回避し、静的防衛枠組みの制限を明らかにしました。
この作業では、コンテキスト検索のレンズを通じて進化する刑務所の脅威に対する防御を探ります。
第一に、特定の脱獄に対して最小限の安全整列例でさえ、この攻撃パターンに対する堅牢性を大幅に高めることができることを実証する予備研究を実施します。
この洞察に基づいて、私たちはさらに検索された生成（RAG）テクニックを活用し、安全性コンテキスト検索（SCR）を提案します。
当社の包括的な実験は、SCRが確立された脱獄戦術と新興の両方の戦術の両方に対して優れた防御パフォーマンスを達成し、LLMの安全に新しいパラダイムを提供する方法を示しています。
私たちのコードは公開時に利用可能になります。

要約(オリジナル)

Large Language Models (LLMs) are known to be vulnerable to jailbreaking attacks, wherein adversaries exploit carefully engineered prompts to induce harmful or unethical responses. Such threats have raised critical concerns about the safety and reliability of LLMs in real-world deployment. While existing defense mechanisms partially mitigate such risks, subsequent advancements in adversarial techniques have enabled novel jailbreaking methods to circumvent these protections, exposing the limitations of static defense frameworks. In this work, we explore defending against evolving jailbreaking threats through the lens of context retrieval. First, we conduct a preliminary study demonstrating that even a minimal set of safety-aligned examples against a particular jailbreak can significantly enhance robustness against this attack pattern. Building on this insight, we further leverage the retrieval-augmented generation (RAG) techniques and propose Safety Context Retrieval (SCR), a scalable and robust safeguarding paradigm for LLMs against jailbreaking. Our comprehensive experiments demonstrate how SCR achieves superior defensive performance against both established and emerging jailbreaking tactics, contributing a new paradigm to LLM safety. Our code will be available upon publication.

arxiv情報

著者	Taiye Chen,Zeming Wei,Ang Li,Yisen Wang
発行日	2025-05-21 16:58:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー