MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

要約

チェーンオブシュート（COT）で質問に答えることで、大規模な言語モデル（LLM）の推論能力が大幅に向上しましたが、大規模なマルチモーダルモデル（LMM）への影響には、体系的な評価と詳細な調査が依然として欠けています。
このペーパーでは、MME-COTを紹介します。MME-COTは、数学、科学、OCR、ロジック、時空、一般シーンの6つのドメインにまたがるLMMSのCOT推論パフォーマンスを評価する専門ベンチマークです。
この分野での最初の包括的な研究として、微粒レベルで推論の質、堅牢性、効率を評価する3つの新しいメトリックを組み込んだ徹底的な評価スイートを提案します。
キュレーションされた高品質のデータとユニークな評価戦略を活用して、最先端のLMMの詳細な分析を実施し、いくつかの重要な洞察を明らかにします：1）反射メカニズムを備えたモデルは、Kimi K1を使用して優れたCOT品質を示します。
5 GPT-4Oを上回り、最高品質の結果を実証します。
2）COTプロンプトは、しばしば知覚が多いタスクでLMMのパフォーマンスを低下させ、潜在的に有害な過剰考え方を示唆しています。
3）COTの品質は高いですが、反射を伴うLMMは、正常な応答段階と自己修正段階の両方で有意な非効率性を示します。
MME-COTがLMMでマルチモーダル推論を進めるための基盤として機能することを願っています。
プロジェクトページ：https：//mmecot.github.io/

要約(オリジナル)

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level. Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases. We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs. Project Page: https://mmecot.github.io/

arxiv情報

著者	Dongzhi Jiang,Renrui Zhang,Ziyu Guo,Yanwei Li,Yu Qi,Xinyan Chen,Liuhui Wang,Jianhan Jin,Claire Guo,Shen Yan,Bo Zhang,Chaoyou Fu,Peng Gao,Hongsheng Li
発行日	2025-02-13 18:59:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー