Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

要約

大規模言語モデル (LLM) の強力な機能の恩恵を受けて、LLM に接続された事前トレーニング済みビジュアルエンコーダーモデルは、ビジョン言語モデル (VLM) を形成します。
しかし、最近の研究によると、VLM の視覚モダリティは非常に脆弱であり、攻撃者が視覚的に送信されるコンテンツを通じて LLM の安全調整をバイパスし、有害な攻撃を仕掛けることができます。
この課題に対処するために、我々は、視覚的なモダリティの安全性調整を強化するためにコンセプトのボトルネックとして安全モジュールを組み込む、進歩的なコンセプトベースの調整戦略である PSA-VLM を提案します。
モデルの予測を特定の安全コンセプトに合わせることで、危険なイメージに対する防御力が向上し、一般的なパフォーマンスへの影響を最小限に抑えながら、説明可能性と制御可能性が向上します。
私たちのメソッドは 2 段階のトレーニングを通じて得られます。
第 1 段階の低い計算コストにより、非常に効果的なパフォーマンス向上がもたらされ、第 2 段階での言語モデルの微調整により、安全性能がさらに向上します。
私たちの手法は、一般的な VLM 安全ベンチマークで最先端の結果を達成します。

要約(オリジナル)

Benefiting from the powerful capabilities of Large Language Models (LLMs), pre-trained visual encoder models connected to LLMs form Vision Language Models (VLMs). However, recent research shows that the visual modality in VLMs is highly vulnerable, allowing attackers to bypass safety alignment in LLMs through visually transmitted content, launching harmful attacks. To address this challenge, we propose a progressive concept-based alignment strategy, PSA-VLM, which incorporates safety modules as concept bottlenecks to enhance visual modality safety alignment. By aligning model predictions with specific safety concepts, we improve defenses against risky images, enhancing explainability and controllability while minimally impacting general performance. Our method is obtained through two-stage training. The low computational cost of the first stage brings very effective performance improvement, and the fine-tuning of the language model in the second stage further improves the safety performance. Our method achieves state-of-the-art results on popular VLM safety benchmark.

arxiv情報

著者	Zhendong Liu,Yuanbi Nie,Yingshui Tan,Xiangyu Yue,Qiushi Cui,Chongjun Wang,Xiaoyong Zhu,Bo Zheng
発行日	2024-11-18 13:01:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー