Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

要約

OpenFlamingo、LLaVA、GPT-4 などのマルチモーダル基盤モデルは、現実世界のさまざまなタスクにますます使用されています。
これまでの研究では、これらのモデルが視覚モダリティに対する敵対的攻撃に対して非常に脆弱であることが示されています。
これらの攻撃は偽情報の拡散やユーザーの詐欺に利用される可能性があり、重大なリスクをもたらすため、大規模なマルチモーダル基盤モデルの堅牢性が差し迫った問題となっています。
CLIP モデル、またはそのバリアントの 1 つは、多くの大規模ビジョン言語モデル (LVLM) でフリーズビジョンエンコーダとして使用されます。
LLaVAとOpenFlamingo。
我々は、CLIPに依存するすべてのビジョンダウンストリームタスク（LVLM、ゼロショット分類）に対して堅牢性をもたらす、堅牢なCLIPビジョンエンコーダを取得するための教師なし敵対的微調整スキームを提案します。
特に、元の CLIP モデルを堅牢なモデルに置き換えると、操作された画像を提供する悪意のある第三者による LVLM のユーザーに対するステルス攻撃が不可能になることを示します。
ダウンストリーム LVLM の再トレーニングや微調整は必要ありません。
コードと堅牢なモデルは https://github.com/chs20/RobustVLM で入手できます。

要約(オリジナル)

Multi-modal foundation models like OpenFlamingo, LLaVA, and GPT-4 are increasingly used for various real-world tasks. Prior work has shown that these models are highly vulnerable to adversarial attacks on the vision modality. These attacks can be leveraged to spread fake information or defraud users, and thus pose a significant risk, which makes the robustness of large multi-modal foundation models a pressing problem. The CLIP model, or one of its variants, is used as a frozen vision encoder in many large vision-language models (LVLMs), e.g. LLaVA and OpenFlamingo. We propose an unsupervised adversarial fine-tuning scheme to obtain a robust CLIP vision encoder, which yields robustness on all vision down-stream tasks (LVLMs, zero-shot classification) that rely on CLIP. In particular, we show that stealth-attacks on users of LVLMs by a malicious third party providing manipulated images are no longer possible once one replaces the original CLIP model with our robust one. No retraining or fine-tuning of the down-stream LVLMs is required. The code and robust models are available at https://github.com/chs20/RobustVLM

arxiv情報

著者	Christian Schlarmann,Naman Deep Singh,Francesco Croce,Matthias Hein
発行日	2024-06-05 15:32:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー