PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization

要約

ジェイルブレイク攻撃に対するラージビジョン言語モデル（LVLM）の脆弱性を理解することは、その責任ある実世界展開のために不可欠である。ほとんどの先行研究は、モデルの勾配へのアクセスを必要とするか、脱獄を完了するために人間の知識（プロンプトエンジニアリング）に基づいており、画像とテキストの相互作用をほとんど考慮していないため、ブラックボックスシナリオで脱獄できないか、パフォーマンスが低下しています。これらの限界を克服するために、我々は毒性最大化のための事前誘導型バイモーダル対話型ブラックボックス脱獄攻撃（PBI-Attackと呼ばれる）を提案する。本手法は、代替LVLMを用いて有害コーパスから悪意ある特徴を抽出し、これらの特徴を事前情報として良性画像に埋め込むことから始まる。その後、双方向クロスモーダル相互作用最適化によりこれらの特徴を強化し、生成された応答の毒性を最大化することを目的として、貪欲な探索により二峰性の摂動を交互に繰り返し最適化する。毒性レベルは、十分に訓練された評価モデルを用いて定量化される。実験によると、PBI-Attackは、3つのオープンソースLVLMで92.5%、3つのクローズドソースLVLMで約67.3%の平均攻撃成功率を達成し、これまでの最先端の脱獄手法を凌駕している。免責事項：この論文には、潜在的に不穏で攻撃的な内容が含まれています。

要約(オリジナル)

Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model. Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs. Disclaimer: This paper contains potentially disturbing and offensive content.

arxiv情報

著者	Ruoxi Cheng,Yizhong Ding,Shuirong Cao,Ranjie Duan,Xiaoshuang Jia,Shaowei Yuan,Zhiqiang Wang,Xiaojun Jia
発行日	2025-02-03 11:44:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー