MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

要約

IQテストは、人間の認知能力を評価し、言語の背景、言語能力、またはドメイン固有の知識からの意図的に分離評価を評価するための基本的な方法論として機能し、抽象化と推論におけるコアコンピテンシーを分離します。
しかし、人工知能の研究には現在、マルチモーダルシステムにおけるこれらの重要な認知能力を定量化するための体系的なベンチマークがありません。
この重要なギャップに対処するために、包括的な評価フレームワークであるMM-IQを提案します。これは、4,776の視覚的推論問題と、8個の異なる推論パラダイムにまたがる2,710の細心のキュレーションテスト項目を備えた大規模なトレーニングセットを構成します。
既存のオープンソースと独自のマルチモーダルモデルの体系的な評価を通じて、当社のベンチマークは顕著な制限を明らかにしています。最先端のアーキテクチャでさえ、ランダムなチャンスよりもわずかに優れたパフォーマンスのみを達成します（33.17％対25％のベースライン精度）。
この実質的なパフォーマンスの割れ目は、基本的な人間の推論能力を近似する際の現在のマルチモーダルモデルの不十分さを強調し、この認知的格差を埋めるためのパラダイムシフトの進歩の必要性を強調しています。
さらに、最近の大規模な推論モデルの急増に触発されて、検証可能な報酬機能を備えた補強学習を介して訓練されたベースラインとしてマルチモーダル推論モデルをリリースし、モデルサイズが顕著で最先端のパフォーマンスに達します。

要約(オリジナル)

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive capabilities in multimodal systems. To address this crucial gap, we propose MM-IQ, a comprehensive evaluation framework, which comprises a large-scale training set with 4,776 visual reasoning problems and 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms. Through systematic evaluation of existing open-source and proprietary multimodal models, our benchmark reveals striking limitations: even state-of-the-art architectures achieve only marginally superior performance to random chance (33.17% vs. 25% baseline accuracy). This substantial performance chasm highlights the inadequacy of current multimodal models in approximating fundamental human reasoning capacities, underscoring the need for paradigm-shifting advancements to bridge this cognitive divide. Moreover, inspired by the recent surge of large reasoning models, we also release a multimodal reasoning model as the baseline that is trained via reinforcement learning with verifiable reward functions, reaching competitive performance to the state-of-the-art with a notably smaller model size.

arxiv情報

著者	Huanqia Cai,Yijun Yang,Winston Hu
発行日	2025-06-04 16:20:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MM-IQ: Benchmarking Human-Like Abstraction and Reasoning in Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー