Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

要約

アラインメントメカニズムに力を与えられているにもかかわらず、大規模な言語モデル（LLM）は、アライメントメカニズムを損なう可能性のある新たな脱獄攻撃に対してますます脆弱になります。
この脆弱性は、実際のアプリケーションに重大なリスクをもたらします。
既存の作業は、トレーニング効率と一般化能力の両方において課題に直面しています（つまり、人間のフィードバックと赤い世話からの強化学習）。
LLMが継続的に進化する脱却の試みに抵抗できるようにするための効果的な戦略を開発することは、重要な課題を表しています。
この課題に対処するために、Guidelinellmと呼ばれる新しい防御パラダイムを提案します。これは、有害なコンテンツを持つ可能性のあるクエリの認識を支援するGuidelinellmと呼ばれます。
LLMSがクエリに応答する前に、Guidelinellmは最初にクエリに関連する潜在的なリスクを特定し、これらのリスクをガイドラインの提案に要約し、次にこれらのガイドラインを応答するLLMにフィードします。
重要なことに、私たちのアプローチは、LLMS自体の追加の安全性微調整の必要性を排除します。
Guidelinellmのみが微調整を必要とします。
この特徴は、さまざまなLLMにわたるGuidelinellmの一般的な適用性を高めます。
実験結果は、GuidelInellmがLLMに対する攻撃成功率（ASR）を大幅に減らすことができることを示しています（平均34.17 \％ASRの平均減少）は、良性クエリの処理におけるLLMの有用性を維持しています。
このコードは、https：//github.com/sqzhang-lazy/guidelinellmで入手できます。

要約(オリジナル)

Despite being empowered with alignment mechanisms, large language models (LLMs) are increasingly vulnerable to emerging jailbreak attacks that can compromise their alignment mechanisms. This vulnerability poses significant risks to real-world applications. Existing work faces challenges in both training efficiency and generalization capabilities (i.e., Reinforcement Learning from Human Feedback and Red-Teaming). Developing effective strategies to enable LLMs to resist continuously evolving jailbreak attempts represents a significant challenge. To address this challenge, we propose a novel defensive paradigm called GuidelineLLM, which assists LLMs in recognizing queries that may have harmful content. Before LLMs respond to a query, GuidelineLLM first identifies potential risks associated with the query, summarizes these risks into guideline suggestions, and then feeds these guidelines to the responding LLMs. Importantly, our approach eliminates the necessity for additional safety fine-tuning of the LLMs themselves; only the GuidelineLLM requires fine-tuning. This characteristic enhances the general applicability of GuidelineLLM across various LLMs. Experimental results demonstrate that GuidelineLLM can significantly reduce the attack success rate (ASR) against LLM (an average reduction of 34.17\% ASR) while maintaining the usefulness of LLM in handling benign queries. The code is available at https://github.com/sqzhang-lazy/GuidelineLLM.

arxiv情報

著者	Shaoqing Zhang,Zhuosheng Zhang,Kehai Chen,Rongxiang Weng,Muyun Yang,Tiejun Zhao,Min Zhang
発行日	2025-04-14 12:52:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Look Before You Leap: Enhancing Attention and Vigilance Regarding Harmful Content with GuidelineLLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー