Towards Harmless Multimodal Assistants with Blind Preference Optimization

要約

マルチモーダル大手言語モデル（MLLM）は、マルチモーダルの理解、推論、および相互作用において印象的な能力を実証しています。
MLLMSの広範なアプリケーションを考えると、関連する安全性の問題がますます重要になっています。
MLLMを人間の好みに合わせる際の優先最適化の有効性により、MLLMの安全関連データが緊急に必要です。
これに対処するために、マルチモーダルの指示、会話形式、および人間のフィードバックからのランク付けされたペアの応答を備えた無害なマルチモーダルアシスタントに向けて、MMSAFE-PO優先データセットを構築します。
また、2つの洞察に満ちた観察結果を特定します。モダリティの共同防衛とモダリティの不正行為。これは、MLLMが固有の安全性の課題を提示しながら、一定レベルの固有の防御を持っていることを示しています。
これらの観察に基づいて、盲目的優先最適化（BPO）アプローチを提案します。
3つのベンチマークでの包括的な実験は、BPOがMLLMの安全能力を効果的に強化することを示しています。
特に、BPOはベースMLLMの安全率を45.0％大幅に改善し、DPOアプローチを上回ります。
さらに、BPOをMMSAFE-POデータセットに適用すると、他の安全ベンチマークでのベースMLLMの危険なレートが大幅に低下します（MMセーフティベンチで14.5％、harmevalで82.9％がデータセットとアプローチの両方の有効性と堅牢性を示しています。
https://lu-yang666.github.io/mmsafe-po-web/。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in multimodal understanding, reasoning, and interaction. Given the extensive applications of MLLMs, the associated safety issues have become increasingly critical. Due to the effectiveness of preference optimization in aligning MLLMs with human preferences, there is an urgent need for safety-related preference data for MLLMs. To address this, we construct the MMSafe-PO preference dataset towards harmless multimodal assistants, featuring multimodal instructions, the conversational format, and ranked paired responses from human feedback. We also identify two insightful observations: modality co-defense and modality cheating, which illustrate that MLLMs possess a certain level of inherent defense while still presenting unique safety challenges. Based on these observations, we propose the Blind Preference Optimization (BPO) approach. Comprehensive experiments on three benchmarks show that BPO effectively enhances the safety capabilities of MLLMs. Notably, BPO significantly improves the safety rate of the base MLLM by 45.0%, outperforming the DPO approach. Additionally, applying BPO to the MMSafe-PO dataset greatly reduces the base MLLM’s unsafe rate on other safety benchmarks (14.5% on MM-SafetyBench and 82.9% on HarmEval, demonstrating the effectiveness and robustness of both the dataset and the approach. We release code and data at https://lu-yang666.github.io/MMsafe-PO-Web/.

arxiv情報

著者	Yongqi Li,Lu Yang,Jian Wang,Runyang You,Wenjie Li,Liqiang Nie
発行日	2025-03-18 12:02:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Harmless Multimodal Assistants with Blind Preference Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー