Adversarial Attacks on Multimodal Agents

要約

現在、ビジョン対応言語モデル (VLM) は、実際の環境でアクションを実行できる自律型マルチモーダルエージェントを構築するために使用されています。
この論文では、環境へのアクセスや環境に関する知識が限られているため、エージェントの攻撃が以前の攻撃よりも困難であるにもかかわらず、マルチモーダルエージェントが新たな安全リスクを引き起こすことを示します。
私たちの攻撃は、敵対的なテキスト文字列を使用して、環境内の 1 つのトリガー画像に対して勾配ベースの摂動を誘導します。(1) キャプション攻撃は、VLM への追加入力として画像をキャプションに処理するためにホワイトボックスキャプションが使用されている場合、ホワイトボックスキャプショナーを攻撃します。
(2) 当社の CLIP 攻撃は、一連の CLIP モデルを共同で攻撃し、独自の VLM に転送する可能性があります。
攻撃を評価するために、Web ベースのマルチモーダルエージェントタスクの環境である VisualWebArena に基づく敵対的タスクのセットである VisualWebArena-Adv を厳選しました。
単一画像の $16/256$ という L-infinity ノルム内で、キャプション攻撃は、キャプション拡張 GPT-4V エージェントに 75% の成功率で敵対的な目標を実行させることができます。
キャプショナーを削除するか、GPT-4V を使用して独自のキャプションを生成すると、CLIP 攻撃の成功率はそれぞれ 21% と 43% に達します。
Gemini-1.5、Claude-3、GPT-4o などの他の VLM に基づくエージェントの実験では、堅牢性において興味深い違いが示されています。
さらなる分析により、攻撃の成功に寄与するいくつかの重要な要因が明らかになり、防御への影響についても説明します。
プロジェクトページ: https://chenwu.io/攻撃-agent コードとデータ: https://github.com/ChenWu98/agent-攻撃

要約(オリジナル)

Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack’s success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack

arxiv情報

著者	Chen Henry Wu,Jing Yu Koh,Ruslan Salakhutdinov,Daniel Fried,Aditi Raghunathan
発行日	2024-06-18 17:32:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adversarial Attacks on Multimodal Agents

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー