Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

要約

マルチモーダルモデルの最近の進歩により、視覚的認識、推論能力、視覚言語の理解に強い能力が示されています。
ただし、オブジェクトの視覚的対応を見つけることが視覚研究で不可欠である場合、視覚的マッチング能力に関する研究は欠落しています。
私たちの研究は、最近のマルチモーダルLLMS（MLLMS）のマッチング機能が、現在の強力なMLLMモデルであるGPT-4Oでさえ、系統的な欠点を依然として示していることを明らかにしています。
特に、マルチモーダルビジュアルマッチング（MMVM）ベンチマークを構築して、30の異なるMLLMをかなりベンチマークします。
MMVMベンチマークは、手動注釈付きの15のオープンソースデータセットとインターネットビデオから構築されています。
MMVMベンチマークのデータサンプルを、現在のMLLMをより包括的に評価および分析するために、必要なキューと機能に基づいて8つの側面に分類します。
さらに、220Kの視覚的マッチングデータを含むMMVM SFTデータセットを生成するための自動アノテーションパイプラインを設計しました。
最後に、2つの斬新な技術デザインを備えた新しいコントラストMLLMであるColvaを提示します。オブジェクトレベルの対照学習と指導の増強戦略を備えた微細なビジョンエキスパートです。
COLVAは、MMVMベンチマークで51.06 \％全体の精度（OA）を達成し、GPT-4Oとベースラインをそれぞれ8.41 \％および23.58 \％OAを上回ります。
結果は、MMVM SFTデータセットの有効性と新しい技術デザインを示しています。
コード、ベンチマーク、データセット、およびモデルは、https：//github.com/zhouyiks/colvaで入手できます。

要約(オリジナル)

Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41\% and 23.58\% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at https://github.com/zhouyiks/CoLVA.

arxiv情報

著者	Yikang Zhou,Tao Zhang,Shilin Xu,Shihao Chen,Qianyu Zhou,Yunhai Tong,Shunping Ji,Jiangning Zhang,Xiangtai Li,Lu Qi
発行日	2025-01-31 16:12:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー