LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

要約

マルチモーダル大手言語モデル（MLLM）は、視覚的および言語情報の統合に大きな進歩を遂げていますが、複雑で現実世界のシナリオについて推論する能力は限られたままです。
既存のベンチマークは通常、異なるタスクサンプルが同じデータ分布から得られることを保証することなく、タスク指向の方法で構築されます。
この制限を解除するために、レンズを貢献します。レンズは、3.4Kの現代的な画像と60k+の人間が執筆した質問を含むマルチレベルのベンチマークであり、1日の8つのタスクと12のシナリオをカバーし、3つの進歩的なタスク層、つまり知覚、理解、推論を形成します。
1つの機能は、各画像にすべてのタスクに豊富な注釈が装備されていることです。
したがって、このデータセットは本質的に、MLLMを評価して、基本的な認識から構成の推論まで、画像不変のプロンプトを処理することをサポートしています。
さらに、私たちの画像は、53％が2025年1月より遅く公開されたソーシャルメディアからマンリー収集されています。QWEN2.5-VL-72B、INTERNVL3-78B、GPT-4O、および2つの推論モデルQVQ-72B-PREVIEW、KIMIVLなどの15以上のフロンティアMLLMを評価します。
これらのモデルは2024年12月より遅くリリースされており、推論タスクで60％を超える精度を達成するものはありません。
プロジェクトページ：https：//github.com/lens4mllms/lens。
ICCV 2025ワークショップページ：https：//lens4mllms.github.io/mars2-workshop-iccv2025/

要約(オリジナル)

Multimodal Large Language Models (MLLMs) have achieved significant advances in integrating visual and linguistic information, yet their ability to reason about complex and real-world scenarios remains limited. The existing benchmarks are usually constructed in the task-oriented manner without guarantee that different task samples come from the same data distribution, thus they often fall short in evaluating the synergistic effects of lower-level perceptual capabilities on higher-order reasoning. To lift this limitation, we contribute Lens, a multi-level benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios, forming three progressive task tiers, i.e., perception, understanding, and reasoning. One feature is that each image is equipped with rich annotations for all tasks. Thus, this dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning. In addition, our images are manully collected from the social media, in which 53% were published later than Jan. 2025. We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL. These models are released later than Dec. 2024, and none of them achieve an accuracy greater than 60% in the reasoning tasks. Project page: https://github.com/Lens4MLLMs/lens. ICCV 2025 workshop page: https://lens4mllms.github.io/mars2-workshop-iccv2025/

arxiv情報

著者	Ruilin Yao,Bo Zhang,Jirui Huang,Xinwei Long,Yifang Zhang,Tianyu Zou,Yufei Wu,Shichao Su,Yifan Xu,Wenxi Zeng,Zhaoyu Yang,Guoyou Li,Shilan Zhang,Zichan Li,Yaxiong Chen,Shengwu Xiong,Peng Xu,Jiajun Zhang,Bowen Zhou,David Clifton,Luc Van Gool
発行日	2025-05-21 15:06:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー