Self-interpreting Adversarial Images

要約

自己解釈画像の作成を可能にする視覚言語モデルに対する新しいタイプの間接的なクロスインジェクション攻撃を導入します。
これらの画像には、モデルが画像に関するユーザーの質問に答える方法を制御する隠された「メタインストラクション」が含まれており、モデルの出力を操作して、敵対的なスタイル、感情、または視点を表現しています。
自己解釈画像はソフトプロンプトとして機能し、モデルを調整して、画像の視覚コンテンツに基づいて回答を生成しながら、敵の（メタ）目的を満たします。
したがって、メタインストラクションは迅速な注入のより強い形態です。
敵対的なイメージは自然に見え、モデルの答えは一貫性があり、もっともらしいですが、それはまた、敵対した解釈、例えば政治的スピン、または明示的なテキストの指示では達成できない目的でさえも従います。
さまざまなモデル、解釈、ユーザープロンプトの自己解釈画像の有効性を評価します。
これらの攻撃が、スパム、誤った情報、またはスピンを運ぶ自己解釈コンテンツの作成を可能にすることにより、どのように害を引き起こす可能性があるかを説明します。
最後に、防御について説明します。

要約(オリジナル)

We introduce a new type of indirect, cross-modal injection attacks against visual language models that enable creation of self-interpreting images. These images contain hidden ‘meta-instructions’ that control how models answer users’ questions about the image and steer models’ outputs to express an adversary-chosen style, sentiment, or point of view. Self-interpreting images act as soft prompts, conditioning the model to satisfy the adversary’s (meta-)objective while still producing answers based on the image’s visual content. Meta-instructions are thus a stronger form of prompt injection. Adversarial images look natural and the model’s answers are coherent and plausible, yet they also follow the adversary-chosen interpretation, e.g., political spin, or even objectives that are not achievable with explicit text instructions. We evaluate the efficacy of self-interpreting images for a variety of models, interpretations, and user prompts. We describe how these attacks could cause harm by enabling creation of self-interpreting content that carries spam, misinformation, or spin. Finally, we discuss defenses.

arxiv情報

著者	Tingwei Zhang,Collin Zhang,John X. Morris,Eugene Bagdasarian,Vitaly Shmatikov
発行日	2025-06-13 16:53:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-interpreting Adversarial Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー