Do Multimodal Large Language Models See Like Humans?

要約

マルチモーダル大規模言語モデル (MLLM) は、大規模言語モデルの最近の進歩を活用して、さまざまな視覚タスクで目覚ましい結果を達成しました。
しかし、MLLM は人間と同じように視覚情報を認識するのかという重要な疑問は解決されていないままです。
現在のベンチマークには、この観点から MLLM を評価する機能がありません。
この課題に対処するために、人間の視覚を反映する基本的な視覚タスクにおける MLLM と人間の視覚システム (HVS) の間の整合性を評価するために設計された大規模ベンチマークである HVSBench を導入します。
HVSBench は、HVS の 13 のカテゴリと 5 つの分野 (目立つ、サブタイジング、優先順位付け、無料視聴、検索を含む) にわたる 85,000 を超えるマルチモーダルサンプルを厳選しました。
広範な実験により、MLLM の包括的な評価を提供する際のベンチマークの有効性が実証されています。
具体的には、13 個の MLLM を評価し、最良のモデルであっても改善の余地が大きく、ほとんどのモデルでは中程度の結果しか得られないことが明らかになりました。
私たちの実験では、HVSBench が最先端の MLLM にとって新たな重要な課題を提示していることが明らかになりました。
私たちは、HVSBench が人間と連携して説明可能な MLLM の研究を促進し、MLLM が視覚情報をどのように認識し、処理するかを理解する上で重要なステップとなると信じています。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.

arxiv情報

著者	Jiaying Lin,Shuquan Ye,Rynson W. H. Lau
発行日	2024-12-12 18:59:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Multimodal Large Language Models See Like Humans?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー