Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

要約

生成されたコンテンツの安全性を高めるためのアライメント技術を使用した大規模言語モデル (LLM) のトレーニングが進歩しているにもかかわらず、これらのモデルは依然としてジェイルブレイク (LLM のセキュリティの脆弱性を暴露する敵対的攻撃手法) の影響を受けやすいままです。
特に、貪欲座標勾配 (GCG) メソッドは、最先端の LLM をジェイルブレイクする敵対的なサフィックスを自動的に生成する機能を実証しました。
ただし、GCG に含まれる最適化プロセスには非常に時間がかかり、ジェイルブレイクパイプラインの効率が低下します。
この論文では、GCG のプロセスを調査し、GCG 最適化の主要なボトルネックである間接効果の問題を特定します。
この目的を達成するために、我々はモデル攻撃勾配インデックス GCG (MAGIC) を提案します。これは、サフィックストークンの勾配情報を利用することで間接効果に対処し、それによって計算と反復が少なくなることで手順を高速化します。
AdvBench での実験では、MAGIC が他のベースラインと同等かそれ以上の攻撃成功率 (ASR) を維持しながら、最大 1.5 倍の高速化を達成することが示されています。
当社の MAGIC は、Llama-2 で 74% の ASR を達成し、GPT-3.5 で転送攻撃を実行した場合は 54% の ASR を達成しました。
コードは https://github.com/jiah-li/magic で入手できます。

要約(オリジナル)

Despite the advancements in training Large Language Models (LLMs) with alignment techniques to enhance the safety of generated content, these models remain susceptible to jailbreak, an adversarial attack method that exposes security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG) method has demonstrated the ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs. However, the optimization process involved in GCG is highly time-consuming, rendering the jailbreaking pipeline inefficient. In this paper, we investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization. To this end, we propose the Model Attack Gradient Index GCG (MAGIC), that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens, thereby accelerating the procedure by having less computation and fewer iterations. Our experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of 74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on GPT-3.5. Code is available at https://github.com/jiah-li/magic.

arxiv情報

著者	Jiahui Li,Yongchang Hao,Haoyu Xu,Xing Wang,Yu Hong
発行日	2024-12-11 18:37:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー