MM-IFEngine: Towards Multimodal Instruction Following

要約

次の（IF）能力は、マルチモーダルの大手言語モデル（MLLM）がどの程度よく理解しているか、そして彼らがそれを正しく行っているかどうかを正確に理解していることを測定します。
トレーニングデータに続く既存のマルチモーダル命令は不足しており、ベンチマークはアトミック命令で簡単になり、評価戦略は正確な出力制約を必要とするタスクに不正確です。
これに対処するために、高品質の画像インストラクションペアを生成するための効果的なパイプラインであるMM-Ifengineを提示します。
MM-Ifengine Pipelineは、大規模で多様な、高品質のトレーニングデータMM-Ifinstruct-23Kを生成します。これは、監視された微調整（SFT）に適しており、MM-IFDPO-23Kとして直接選好最適化（DPO）として拡張されます。
さらに、（1）入力画像に結び付けられた出力応答と知覚レベルの制約の構成と（2）ルールベースの評価と裁判官モデルの両方を組み込む包括的な評価パイプラインの両方を含む、挑戦的で多様なマルチモーダル命令に応じたベンチマークであるMM-Ifalvalをさらに紹介します。
SFTおよびDPO実験を実施し、MM-IfinStruct-23KおよびMM-IFDPO-23Kの微調整MLLMが、MM-Ifalval（+10.2 $ \％$）、Mia（+7.6 $ \％）、Ifeval（+12.3 $）など、さまざまなベンチマークで顕著な利益を達成することを実証します。
完全なデータと評価コードは、https：//github.com/syuan03/mm-ifengineでリリースされます。

要約(オリジナル)

The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and whether they are doing it right. Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints. To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs. Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO). We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both compose-level constraints for output responses and perception-level constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating both rule-based assessment and judge model. We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieves notable gains on various IF benchmarks, such as MM-IFEval (+10.2$\%$), MIA (+7.6$\%$), and IFEval (+12.3$\%$). The full data and evaluation code will be released on https://github.com/SYuan03/MM-IFEngine.

arxiv情報

著者	Shengyuan Ding,Shenxi Wu,Xiangyu Zhao,Yuhang Zang,Haodong Duan,Xiaoyi Dong,Pan Zhang,Yuhang Cao,Dahua Lin,Jiaqi Wang
発行日	2025-04-10 17:59:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MM-IFEngine: Towards Multimodal Instruction Following

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー