Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

要約

マルチモーダル大規模言語モデル (MLLM) エージェントは、命令を受け取り、画像をキャプチャし、メモリから履歴を取得し、使用するツールを決定できます。
それにもかかわらず、レッドチームの取り組みにより、敵対的な画像やプロンプトが MLLM をジェイルブレイクし、不整合な動作を引き起こす可能性があることが明らかになりました。
この研究では、感染性ジェイルブレイクと呼ばれる、マルチエージェント環境におけるさらに深刻な安全上の問題を報告します。
これには、敵対者が 1 つのエージェントをジェイルブレイクするだけで、敵対者がそれ以上の介入を行わなければ、(ほぼ) すべてのエージェントが指数関数的に急速に感染し、有害な動作を示すことになります。
感染性ジェイルブレイクの実現可能性を検証するために、最大 100 万人の LLaVA-1.5 エージェントを含むマルチエージェント環境をシミュレートし、マルチエージェントインタラクションの概念実証インスタンス化としてランダム化されたペアワイズチャットを採用しました。
私たちの結果は、（感染性の）敵対的な画像をランダムに選択されたエージェントのメモリにフィードするだけで、感染性のジェイルブレイクを達成するのに十分であることを示しています。
最後に、防御メカニズムが感染性脱獄の蔓延を確実に抑制できるかどうかを判断するための簡単な原則を導き出しますが、この原則を満たす実際的な防御をどのように設計するかは未解決の問題のままです。
私たちのプロジェクトページは https://sail-sg.github.io/Agent-Smith/ から入手できます。

要約(オリジナル)

A multimodal large language model (MLLM) agent can receive instructions, capture images, retrieve histories from memory, and decide which tools to use. Nonetheless, red-teaming efforts have revealed that adversarial images/prompts can jailbreak an MLLM and cause unaligned behaviors. In this work, we report an even more severe safety issue in multi-agent environments, referred to as infectious jailbreak. It entails the adversary simply jailbreaking a single agent, and without any further intervention from the adversary, (almost) all agents will become infected exponentially fast and exhibit harmful behaviors. To validate the feasibility of infectious jailbreak, we simulate multi-agent environments containing up to one million LLaVA-1.5 agents, and employ randomized pair-wise chat as a proof-of-concept instantiation for multi-agent interaction. Our results show that feeding an (infectious) adversarial image into the memory of any randomly chosen agent is sufficient to achieve infectious jailbreak. Finally, we derive a simple principle for determining whether a defense mechanism can provably restrain the spread of infectious jailbreak, but how to design a practical defense that meets this principle remains an open question to investigate. Our project page is available at https://sail-sg.github.io/Agent-Smith/.

arxiv情報

著者	Xiangming Gu,Xiaosen Zheng,Tianyu Pang,Chao Du,Qian Liu,Ye Wang,Jing Jiang,Min Lin
発行日	2024-02-13 16:06:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー