Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

要約

マルチモーダルモデルの最近の進歩により、視覚認識、推論能力、視覚言語理解における強力な能力が示されています。
しかし、視覚研究では物体の視覚的対応関係を見つけることが不可欠である視覚照合能力に関する研究は欠落している。
私たちの調査により、最新のマルチモーダル LLM (MLLM) のマッチング機能には、現在の強力な MLLM モデルである GPT-4o であっても、依然として体系的な欠点があることが明らかになりました。
特に、30 を超える異なる MLLM を公平にベンチマークするために、マルチモーダルビジュアルマッチング (MMVM) ベンチマークを構築します。
MMVM ベンチマークは、15 のオープンソースデータセットと手動アノテーション付きのインターネットビデオから構築されています。
現在の MLLM をより包括的に評価および分析するために、必要な手がかりと機能に基づいて MMVM ベンチマークのデータサンプルを 8 つの側面に分類します。
さらに、推論アノテーションを備えた 220K のビジュアルマッチングデータを含む MMVM SFT データセットを生成するための自動アノテーションパイプラインを設計しました。
最後に、2 つの新しい技術設計を備えた新しい対照的 MLLM である CoLVA を紹介します。オブジェクトレベルの対照的学習と命令拡張戦略を備えたきめ細かいビジョンエキスパートです。
CoLVA は、MMVM ベンチマークで 51.06\% の総合精度 (OA) を達成し、GPT-4o とベースラインをそれぞれ 8.41\% と 23.58\% OA 上回っています。
この結果は、MMVM SFT データセットと新しい技術設計の有効性を示しています。
コード、ベンチマーク、データセット、モデルは https://github.com/zhouyiks/CoLVA で入手できます。

要約(オリジナル)

Recent advancements in multimodal models have shown a strong ability in visual perception, reasoning abilities, and vision-language understanding. However, studies on visual matching ability are missing, where finding the visual correspondence of objects is essential in vision research. Our research reveals that the matching capabilities in recent multimodal LLMs (MLLMs) still exhibit systematic shortcomings, even with current strong MLLMs models, GPT-4o. In particular, we construct a Multimodal Visual Matching (MMVM) benchmark to fairly benchmark over 30 different MLLMs. The MMVM benchmark is built from 15 open-source datasets and Internet videos with manual annotation. We categorize the data samples of MMVM benchmark into eight aspects based on the required cues and capabilities to more comprehensively evaluate and analyze current MLLMs. In addition, we have designed an automatic annotation pipeline to generate the MMVM SFT dataset, including 220K visual matching data with reasoning annotation. Finally, we present CoLVA, a novel contrastive MLLM with two novel technical designs: fine-grained vision expert with object-level contrastive learning and instruction augmentation strategy. CoLVA achieves 51.06\% overall accuracy (OA) on the MMVM benchmark, surpassing GPT-4o and baseline by 8.41\% and 23.58\% OA, respectively. The results show the effectiveness of our MMVM SFT dataset and our novel technical designs. Code, benchmark, dataset, and models are available at https://github.com/zhouyiks/CoLVA.

arxiv情報

著者	Yikang Zhou,Tao Zhang,Shilin Xu,Shihao Chen,Qianyu Zhou,Yunhai Tong,Shunping Ji,Jiangning Zhang,Xiangtai Li,Lu Qi
発行日	2025-01-08 18:30:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー