Do Multimodal Large Language Models See Like Humans?

要約

マルチモーダル大手言語モデル（MLLM）は、さまざまなビジョンタスクで印象的な結果を達成しており、最近の大規模な言語モデルの進歩を活用しています。
しかし、重大な質問は未解決のままです。MLLMSは人間と同様に視覚情報を認識していますか？
現在のベンチマークには、この観点からMLLMを評価する機能がありません。
この課題に対処するために、HVSBenchを紹介します。HVSBenchは、人間の視覚を反映する基本的なビジョンタスクに関するMLLMと人間の視覚システム（HVS）のアライメントを評価するために設計された大規模なベンチマークです。
HVSBenchは、85K以上のマルチモーダルサンプルをキュレーションし、HVSの13のカテゴリと5つのフィールドにまたがって、顕著、サブタイズ、優先順位付け、フリービューリング、検索を含みました。
広範な実験は、MLLMの包括的な評価を提供する際のベンチマークの有効性を示しています。
具体的には、13 Mllmsを評価し、最良のモデルでさえ改善の重要な余地を示しており、ほとんどが中程度の結果しか達成されていないことが明らかになりました。
私たちの実験は、HVSBenchが最先端のMLLMに新しい重要な課題を提示することを明らかにしています。
多様な人間の参加者は、強力なパフォーマンスを達成し、MLLMを大幅に上回り、ベンチマークの高品質をさらに強調しています。
HVSBenchは、人間に整合した説明可能なMLLMに関する研究を促進し、MLLMSが視覚情報をどのように認識し処理するかを理解するための重要なステップをマークすると考えています。

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. Diverse human participants attained strong performance, significantly outperforming MLLMs, which further underscores the benchmark’s high quality. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.

arxiv情報

著者	Jiaying Lin,Shuquan Ye,Rynson W. H. Lau
発行日	2025-03-27 17:59:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Multimodal Large Language Models See Like Humans?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー