AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

要約

私たちの研究では、GPT-4 や LLaMa2 などの大規模言語モデル (LLM) に対するジェイルブレイク攻撃の有効性を評価するための新しいアプローチを開拓し、従来の堅牢性に重点を置いたバイナリ評価から分岐させました。
私たちの研究では、粗粒度の評価と粒度の細かい評価という 2 つの異なる評価フレームワークを導入しています。
各フレームワークは 0 から 1 のスコア範囲を使用して独自の視点を提供し、攻撃の有効性をより包括的かつ微妙に評価できるようにし、攻撃者がより深く理解して攻撃プロンプトを調整できるようにします。
さらに、ジェイルブレイクタスクに特化した包括的なグラウンドトゥルースデータセットを開発しました。
このデータセットは、現在の研究の重要なベンチマークとして機能するだけでなく、将来の研究のための基礎リソースを確立し、この進化する分野での一貫した比較分析を可能にします。
従来の評価方法と注意深く比較した結果、私たちの評価はベースラインの傾向と一致しており、より深く詳細な評価を提供していることがわかりました。
私たちは、ジェイルブレイクタスクにおける攻撃プロンプトの有効性を正確に評価することで、プロンプトインジェクションの領域で同様の、またはより複雑なタスクを幅広く評価するための強固な基盤を築き、この分野に革命をもたらす可能性があると信じています。

要約(オリジナル)

In our research, we pioneer a novel approach to evaluate the effectiveness of jailbreak attacks on Large Language Models (LLMs), such as GPT-4 and LLaMa2, diverging from traditional robustness-focused binary evaluations. Our study introduces two distinct evaluation frameworks: a coarse-grained evaluation and a fine-grained evaluation. Each framework, using a scoring range from 0 to 1, offers a unique perspective, enabling a more comprehensive and nuanced evaluation of attack effectiveness and empowering attackers to refine their attack prompts with greater understanding. Furthermore, we have developed a comprehensive ground truth dataset specifically tailored for jailbreak tasks. This dataset not only serves as a crucial benchmark for our current study but also establishes a foundational resource for future research, enabling consistent and comparative analyses in this evolving field. Upon meticulous comparison with traditional evaluation methods, we discovered that our evaluation aligns with the baseline’s trend while offering a more profound and detailed assessment. We believe that by accurately evaluating the effectiveness of attack prompts in the Jailbreak task, our work lays a solid foundation for assessing a wider array of similar or even more complex tasks in the realm of prompt injection, potentially revolutionizing this field.

arxiv情報

著者	Dong shu,Mingyu Jin,Suiyuan Zhu,Beichen Wang,Zihao Zhou,Chong Zhang,Yongfeng Zhang
発行日	2024-03-20 14:08:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー