II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

要約

マルチモーダル大規模言語モデル (MLLM) の開発における急速な進歩により、さまざまなベンチマークで常に新たなブレークスルーがもたらされています。
これに応えて、MLLM の能力をより正確に評価するために、数多くの挑戦的で包括的なベンチマークが提案されています。
しかし、MLLM の高次の知覚能力については研究が不足しています。
このギャップを埋めるために、モデルの高次の画像認識を評価することを目的とした画像含意理解ベンチマーク II ベンチを提案します。
複数の MLLM にわたる II-Bench の広範な実験を通じて、私たちは重要な発見をしました。
最初は、II-Bench における MLLM と人間のパフォーマンスの間に大きなギャップが観察されます。
MLLM の最高精度は 74.8% に達しますが、人間の精度は平均 90% で、最高値は 98% に達します。
その後、抽象的で複雑な画像に対する MLLM のパフォーマンスが低下し、高レベルのセマンティクスを理解し、画像の詳細をキャプチャする能力に限界があることが示唆されています。
最後に、画像感情の極性に関するヒントがプロンプトに組み込まれている場合、ほとんどのモデルの精度が向上することが観察されます。
この観察は、彼らの画像感情に対する本質的な理解が著しく欠如していることを強調しています。
私たちは、II-Bench がコミュニティに次世代の MLLM の開発を促し、エキスパートの汎用人工知能 (AGI) への道を前進させると信じています。
II-Bench は、https://huggingface.co/datasets/m-a-p/II-Bench で公開されています。

要約(オリジナル)

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model’s higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.

arxiv情報

著者	Ziqiang Liu,Feiteng Fang,Xi Feng,Xinrun Du,Chenhao Zhang,Zekun Wang,Yuelin Bai,Qixuan Zhao,Liyang Fan,Chengguang Gan,Hongquan Lin,Jiaming Li,Yuansheng Ni,Haihong Wu,Yaswanth Narsupalli,Zhigang Zheng,Chengming Li,Xiping Hu,Ruifeng Xu,Xiaojun Chen,Min Yang,Jiaheng Liu,Ruibo Liu,Wenhao Huang,Ge Zhang,Shiwen Ni
発行日	2025-01-13 09:33:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー