Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

要約

マルチモーダル大規模言語モデル (MLLM) の最近の進歩により、単一画像のきめ細かい認識と複数の画像にわたる一般的な理解が大幅に向上しました。
ただし、既存の MLLM は、複雑なマルチイメージシナリオで正確な接地を実現するという課題に依然として直面しています。
これに対処するために、私たちはまず、単一イメージのグラウンディングと複数イメージの理解を統合する思考連鎖 (CoT) フレームワークを検討します。
部分的には効果的ではありますが、不安定なままであり、エンドツーエンドではない性質のため、抽象的な視覚情報を捕捉するのが困難です。
そこで、複数の画像にわたって自由形式で正確なグラウンディングを実行できる初のマルチイメージグラウンディングモデルである Migician を紹介します。
これを裏付けるために、MGrounding-630k データセットを紹介します。このデータセットは、既存のデータセットから派生したいくつかのマルチイメージグラウンディングタスクのデータと、新しく生成された自由形式のグラウンディング指示に従うデータで構成されています。
さらに、マルチイメージグラウンディング機能を評価するために特別に設計された包括的なベンチマークである MIG-Bench を提案します。
実験結果は、当社のモデルが大幅に優れたマルチイメージグラウンディング機能を実現し、既存の最高の MLLM を 21.61% 上回り、はるかに大型の 70B モデルをも上回っていることを示しています。
私たちのコード、モデル、データセット、ベンチマークは https://migician-vg.github.io/ で完全にオープンソースです。

要約(オリジナル)

The recent advancement of Multimodal Large Language Models (MLLMs) has significantly improved their fine-grained perception of single images and general comprehension across multiple images. However, existing MLLMs still face challenges in achieving precise grounding in complex multi-image scenarios. To address this, we first explore a Chain-of-Thought (CoT) framework that integrates single-image grounding with multi-image comprehension. While partially effective, it remains unstable and struggles to capture abstract visual information due to its non-end-to-end nature. Therefore, we introduce Migician, the first multi-image grounding model capable of performing free-form and accurate grounding across multiple images. To support this, we present the MGrounding-630k dataset, which comprises data for several multi-image grounding tasks derived from existing datasets, along with newly generated free-form grounding instruction-following data. Furthermore, we propose MIG-Bench, a comprehensive benchmark specifically designed for evaluating multi-image grounding capabilities. Experimental results demonstrate that our model achieves significantly superior multi-image grounding capabilities, outperforming the best existing MLLMs by 21.61% and even surpassing much larger 70B models. Our code, model, dataset, and benchmark are fully open-sourced at https://migician-vg.github.io/.

arxiv情報

著者	You Li,Heyu Huang,Chi Chen,Kaiyu Huang,Chao Huang,Zonghao Guo,Zhiyuan Liu,Jinan Xu,Yuhua Li,Ruixuan Li,Maosong Sun
発行日	2025-01-13 10:38:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー