VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

要約

視覚的推論は、人間の認知の中心であり、個人が自分の環境を解釈し、抽象的に理解できるようにします。
最近のマルチモーダル大手言語モデル（MLLM）は、言語とビジョン言語のタスク全体で印象的なパフォーマンスを実証していますが、既存のベンチマークは主に認識ベースのスキルを測定し、真の視覚的推論能力を不十分に評価しています。
To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs.
視覚情報から主に推論するモデルを強化することを検証し、ドメイン固有の知識と言語バイアスへの依存を減らすために最小限のテキストコンテキストを提供します。
それぞれの問題には、人間が解決した推論パスが伴うため、モデルの意思決定プロセスの詳細な評価を提供した最初の推論です。
さらに、単なる正確さを超えた視覚的推論の忠実度を評価する新しいメトリックを提案し、現在のモデル推論パターンの重要な不均衡を強調します。
主要なMLLMの包括的なベンチマークは、知覚と推論の両方に対するバランスのとれた全体的なアプローチの必要性を強調しており、大きな制限を明らかにしています。
ティーザーとテストの詳細については、プロジェクトページ（https://verify-eqh.pages.dev/）をご覧ください。

要約(オリジナル)

Visual reasoning is central to human cognition, enabling individuals to interpret and abstractly understand their environment. Although recent Multimodal Large Language Models (MLLMs) have demonstrated impressive performance across language and vision-language tasks, existing benchmarks primarily measure recognition-based skills and inadequately assess true visual reasoning capabilities. To bridge this critical gap, we introduce VERIFY, a benchmark explicitly designed to isolate and rigorously evaluate the visual reasoning capabilities of state-of-the-art MLLMs. VERIFY compels models to reason primarily from visual information, providing minimal textual context to reduce reliance on domain-specific knowledge and linguistic biases. Each problem is accompanied by a human-annotated reasoning path, making it the first to provide in-depth evaluation of model decision-making processes. Additionally, we propose novel metrics that assess visual reasoning fidelity beyond mere accuracy, highlighting critical imbalances in current model reasoning patterns. Our comprehensive benchmarking of leading MLLMs uncovers significant limitations, underscoring the need for a balanced and holistic approach to both perception and reasoning. For more teaser and testing, visit our project page (https://verify-eqh.pages.dev/).

arxiv情報

著者	Jing Bi,Junjia Guo,Susan Liang,Guangyu Sun,Luchuan Song,Yunlong Tang,Jinxi He,Jiarui Wu,Ali Vosoughi,Chen Chen,Chenliang Xu
発行日	2025-03-14 16:26:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VERIFY: A Benchmark of Visual Explanation and Reasoning for Investigating Multimodal Reasoning Fidelity

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー