Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

要約

高性能マルチモーダル大規模言語モデル (MLLM) は、データ品質に大きく依存します。
この研究では、対照学習と画像差分キャプションからの洞察を活用することで、MLLM におけるきめの細かい画像認識を強化するように設計された、Img-Diff という名前の新しいデータセットを紹介します。
類似した画像間のオブジェクトの違いを分析することで、モデルが一致するコンポーネントと異なるコンポーネントの両方を識別できるようにします。
私たちは、Stable-Diffusion-XL モデルと高度な画像編集技術を利用して、オブジェクトの置き換えを強調する類似した画像のペアを作成します。
私たちの方法論には、オブジェクトの違いを識別するための差分エリアジェネレーターと、その後に詳細な違いを説明するための差分キャプションジェネレーターが含まれています。
その結果、比較的小さいながらも高品質の「オブジェクト置換」サンプルのデータセットが得られます。
提案されたデータセットを使用して、MGM-7B などの最先端 (SOTA) MLLM を微調整し、多数の画像の違いや視覚的な点で、大規模なデータセットでトレーニングされた SOTA モデルと比較してパフォーマンススコアの包括的な向上をもたらします。
質問に答えるタスク。
たとえば、当社のトレーニング済みモデルは、MMVP ベンチマークで SOTA モデル GPT-4V および Gemini を著しく上回っています。
さらに、「オブジェクト除去」を通じて画像差分データを生成する代替方法を調査し、データセットの多様性、品質、堅牢性を確認するための徹底的な評価を実施し、そのような対照的なデータセットの合成に関するいくつかの洞察を示します。
さらなる研究を奨励し、マルチモーダルデータ合成の分野を前進させ、画像を理解するための MLLM の基本的な機能を強化するために、コードとデータセットを https://github.com/modelscope/data-juicer/tree/ImgDiff でリリースします。

要約(オリジナル)

High-performance Multimodal Large Language Models (MLLMs) rely heavily on data quality. This study introduces a novel dataset named Img-Diff, designed to enhance fine-grained image recognition in MLLMs by leveraging insights from contrastive learning and image difference captioning. By analyzing object differences between similar images, we challenge models to identify both matching and distinct components. We utilize the Stable-Diffusion-XL model and advanced image editing techniques to create pairs of similar images that highlight object replacements. Our methodology includes a Difference Area Generator for object differences identifying, followed by a Difference Captions Generator for detailed difference descriptions. The result is a relatively small but high-quality dataset of ‘object replacement’ samples. We use the the proposed dataset to fine-tune state-of-the-art (SOTA) MLLMs such as MGM-7B, yielding comprehensive improvements of performance scores over SOTA models that trained with larger-scale datasets, in numerous image difference and Visual Question Answering tasks. For instance, our trained models notably surpass the SOTA models GPT-4V and Gemini on the MMVP benchmark. Besides, we investigate alternative methods for generating image difference data through ‘object removal’ and conduct thorough evaluation to confirm the dataset’s diversity, quality, and robustness, presenting several insights on synthesis of such contrastive dataset. To encourage further research and advance the field of multimodal data synthesis and enhancement of MLLMs’ fundamental capabilities for image understanding, we release our codes and dataset at https://github.com/modelscope/data-juicer/tree/ImgDiff.

arxiv情報

著者	Qirui Jiao,Daoyuan Chen,Yilun Huang,Yaliang Li,Ying Shen
発行日	2024-08-08 17:10:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー