Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach

要約

人工知能（AI）の急速な進歩により、大規模な言語モデル（LLM）は、ヘルスケア、エンジニアリング、科学、教育、数学的推論など、幅広いドメインに大きな影響を与えました。
これらの中で、数学的推論は依然として特に挑戦的な能力であり、多くの場合、マルチステップロジックと抽象的な一般化が必要です。
以前の作業では、推論タスクに関するLLMのパフォーマンスを調査しましたが、モデルファミリ全体で深さと幅の両方にまたがる包括的な評価は限られたままです。
この研究では、3つの独立したベンチマークデータセットを使用して、最近の2つのDeepSeekモデルを含む8つの主要なLLMにわたる数学的推論能力の体系的な評価を提示します。
分析により、いくつかの重要な調査結果が明らかになりました。（1）DeepSeek-R1は、ほとんどのドメインでO1と競合的に機能し、MMLUの正式なロジックベンチマークで最高の精度を実現します。
（2）Deepseek-1.5bなどの蒸留バリアントは、実質的なパフォーマンスの劣化を示します。
（3）Gemini 2.0 Flashは、応答の最低レイテンシを実現します。
定量的メトリックを超えて、建築の選択、トレーニングパラダイム、および最適化戦略が推論パフォーマンスの変動にどのように貢献するかを探ります。
これらの調査結果は、数学ドメインにおける現在のLLMの能力と制限に関する新しい洞察を提供し、厳密な推論要求に合わせた将来のモデルの開発のためのガイダンスを提供します。

要約(オリジナル)

With the rapid advancement of Artificial Intelligence (AI), Large Language Models (LLMs) have significantly impacted a wide array of domains, including healthcare, engineering, science, education, and mathematical reasoning. Among these, mathematical reasoning remains a particularly challenging capability, often requiring multi-step logic and abstract generalization. While prior work has explored LLM performance on reasoning tasks, comprehensive evaluations that span both depth and breadth across model families remain limited. In this study, we present a systematic evaluation of mathematical reasoning abilities across eight leading LLMs, including two recent DeepSeek models, using three independent benchmark datasets. Our analyses reveal several key findings: (1) DeepSeek-R1 performs competitively with o1 across most domains and achieves the highest accuracy on the MMLU Formal Logic benchmark; (2) distilled variants, such as DeepSeek-1.5B, exhibit substantial performance degradation; and (3) Gemini 2.0 Flash achieves the lowest response latency. Beyond quantitative metrics, we explore how architectural choices, training paradigms, and optimization strategies contribute to variation in reasoning performance. These findings provide new insights into the capabilities and limitations of current LLMs in mathematical domains, and offer guidance for the development of future models better aligned with rigorous reasoning demands.

arxiv情報

著者	Afrar Jahin,Arif Hassan Zidan,Wei Zhang,Yu Bao,Tianming Liu
発行日	2025-05-19 17:36:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー