Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

要約

大規模言語モデル (LLM) の進歩により、自然言語処理におけるアプリケーションの範囲が大幅に拡大し、マルチモーダル LLM によってこれらの機能が拡張され、視覚データを統合および解釈できるようになりました。
ただし、視覚言語モデル (VLM) の既存のベンチマークは主に単一画像の入力に焦点を当てており、複数画像の理解という重要な側面が無視されています。
このペーパーでは、複数の画像を比較、分析、推論する VLM の能力を評価するために設計された、マルチイメージリレーショナルベンチマーク MIRB を紹介します。
私たちのベンチマークには、知覚、視覚世界の知識、推論、マルチホップ推論の 4 つのカテゴリが含まれています。
幅広いオープンソースおよびクローズドソースモデルの包括的な評価を通じて、オープンソース VLM は単一イメージタスクでは GPT-4V のパフォーマンスに近づくことが示されている一方で、マルチイメージタスクでは依然として大きなパフォーマンスギャップが残っていることを実証しました。
イメージ推論タスク。
また、私たちの調査結果では、最先端の GPT-4V モデルでさえ私たちのベンチマークに苦戦していることも明らかになり、この分野でのさらなる研究開発の必要性が強調されています。
私たちは、MIRB への貢献が、次世代マルチモーダルモデル開発のテストベッドとして機能すると信じています。

要約(オリジナル)

The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs’ ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.

arxiv情報

著者	Bingchen Zhao,Yongshuo Zong,Letian Zhang,Timothy Hospedales
発行日	2024-06-18 16:02:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー