MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

要約

図は、構造化されたシンボル、形状、空間的配置を介した複雑な概念とその相互関係を表す視覚言語の基本的な形式として機能します。
自然画像とは異なり、本質的に象徴的で抽象的な性質は、マルチモーダルの大手言語モデル（MLLM）に大きな課題をもたらします。
ただし、現在のベンチマークは、知覚と推論のタスクを混同し、MLLMが表面的なパターン認識を超えた数学的図を本当に理解しているかどうかを評価することを困難にしています。
このギャップに対処するために、MLLMの数学的知覚を分離および評価するために特別に設計されたベンチマークであるMathgranceを導入します。
Mathgranceは、1.2kの画像と、平面ジオメトリ、固体ジオメトリ、グラフィカル表現を含む多様なドメインをカバーする形状分類、オブジェクトカウント、関係の識別、およびオブジェクトの接地の4つの知覚タスクにまたがる1.6kの慎重にキュレーションされた質問を含みます。
MLLMの評価は、図を理解する能力が特に制限されていることを明らかにしています。
これに応じて、Geopepを構築します。これは、幾何学的プリミティブと正確な空間的関係で明示的に注釈が付けられた200K構造化されたジオメトリ画像テキストペアの知覚指向のデータセットです。
GeopepでMLLMをトレーニングすると、知覚精度が大幅に向上し、数学的な推論が大幅に向上します。
当社のベンチマークとデータセットは、マルチモーダル数学的理解を評価および進め、将来のMLLM研究を促進するための貴重なリソースと洞察を提供するための重要な基準を確立します。

要約(オリジナル)

Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements. Unlike natural images, their inherently symbolic and abstract nature poses significant challenges for Multimodal Large Language Models (MLLMs). However, current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether MLLMs genuinely understand mathematical diagrams beyond superficial pattern recognition. To address this gap, we introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs. MATHGLANCE comprises 1.2K images and 1.6K carefully curated questions spanning four perception tasks: shape classification, object counting, relationship identification, and object grounding, covering diverse domains including plane geometry, solid geometry, and graphical representations. Our evaluation of MLLMs reveals that their ability to understand diagrams is notably limited, particularly in fine-grained grounding tasks. In response, we construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text pairs explicitly annotated with geometric primitives and precise spatial relationships. Training MLLM on GeoPeP leads to significant gains in perceptual accuracy, which in turn substantially improves mathematical reasoning. Our benchmark and dataset establish critical standards for evaluating and advancing multimodal mathematical understanding, providing valuable resources and insights to foster future MLLM research.

arxiv情報

著者	Yanpeng Sun,Shan Zhang,Wei Tang,Aotian Chen,Piotr Koniusz,Kai Zou,Yuan Xue,Anton van den Hengel
発行日	2025-03-26 17:30:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー