MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

要約

マルチモーダル LLM の堅牢なマルチ画像理解機能に焦点を当てた包括的なベンチマークである MuirBench を紹介します。
MuirBench は、10 カテゴリのマルチ画像関係 (マルチビュー、時間的関係など) を含む 12 の多様なマルチ画像タスク (シーンの理解、順序付けなど) で構成されています。
11,264 枚の画像と 2,600 件の多肢選択式の質問で構成される MuirBench は、ペアワイズ方式で作成されます。信頼性の高い評価を行うために、各標準インスタンスが意味論的な違いが最小限に抑えられた回答不可能なバリアントとペアになります。
最近の 20 個のマルチモーダル LLM を評価した結果、GPT-4o や Gemini Pro などの最もパフォーマンスの高いモデルでさえ、MuirBench を解くのが難しく、精度で 68.0% と 49.3% を達成していることが明らかになりました。
単一画像でトレーニングされたオープンソースのマルチモーダル LLM は、複数画像の質問に一般化することがほとんどできず、精度は 33.3% 未満にとどまります。
これらの結果は、単一の画像を超えて視野を広げることができるマルチモーダル LLM の開発をコミュニティに奨励する上で MuirBench の重要性を強調し、将来の改善に向けた潜在的な道筋を示唆しています。

要約(オリジナル)

We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.

arxiv情報

著者	Fei Wang,Xingyu Fu,James Y. Huang,Zekun Li,Qin Liu,Xiaogeng Liu,Mingyu Derek Ma,Nan Xu,Wenxuan Zhou,Kai Zhang,Tianyi Lorena Yan,Wenjie Jacky Mo,Hsiang-Hui Liu,Pan Lu,Chunyuan Li,Chaowei Xiao,Kai-Wei Chang,Dan Roth,Sheng Zhang,Hoifung Poon,Muhao Chen
発行日	2024-06-13 17:59:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー