Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

要約

この調査では、悪意のあるユーザーが複数のクエリにわたって有害な意図を隠蔽できる、マルチターンインタラクションにおける大規模言語モデル (LLM) の安全性の脆弱性を明らかにしています。
我々は、アクターネットワーク理論にヒントを得た新しいマルチターン攻撃手法である Actor Attack を紹介します。これは、意味的にリンクされたアクターのネットワークを攻撃の手がかりとしてモデル化し、有害なターゲットへの多様で効果的な攻撃パスを生成します。
Actor Attack は、マルチターン攻撃における 2 つの主な課題に対処します。(1) アクターに関する無害な会話トピックを作成することで有害な意図を隠蔽する。(2) LLM の知識を活用して相関するアクターを特定することで、同じ有害なターゲットへの多様な攻撃パスを明らかにする。
さまざまな攻撃の手がかりとして。
このように、Actor Attack は、GPT-o1 であっても、高度なアライメント LLM 全体で既存のシングルターンおよびマルチターン攻撃方法よりも優れたパフォーマンスを発揮します。
SafeMTData と呼ばれるデータセットを公開します。これには、Actor Attack によって生成された、マルチターンの敵対的プロンプトと安全調整データが含まれます。
私たちは、安全性データセットを使用して安全性を調整したモデルがマルチターン攻撃に対してより堅牢であることを実証します。
コードは https://github.com/renqibing/Actor Attack で入手できます。

要約(オリジナル)

This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs’ knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks. Code is available at https://github.com/renqibing/ActorAttack.

arxiv情報

著者	Qibing Ren,Hao Li,Dongrui Liu,Zhanxu Xie,Xiaoya Lu,Yu Qiao,Lei Sha,Junchi Yan,Lizhuang Ma,Jing Shao
発行日	2024-10-14 16:41:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー