DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

要約

大規模な言語モデル（LLMS）の数学的習熟度を進めるために、Deepmathチームは、オープン数学LLMの開発とその数学的創造性を体系的に評価することを目的としたオープンソースイニシアチブを開始しました。
この論文は、このイニシアチブの最初の貢献を表しています。
学部レベルの数学的タスクから小学校のベンチマークで証明されているように、数学のLLMSの最近の開発は主に推論スキルを強調していますが、これらのモデルの創造的能力は比較的少ない注意を払っておらず、評価データセットは依然として不足しています。
このギャップに対処するために、数学的な創造性の評価基準を提案し、代数、幾何学、分析、およびその他のドメイン間の建設的な問題を含む斬新で高品質のベンチマークであるDeepmath-Creativeを導入します。
このデータセットを使用して、主流のLLMSの創造的な問題解決能力の体系的な評価を実施します。
実験結果は、コアソリューションコンポーネントを強調し、小さな論理ギャップ、不完全な正当化、または冗長な説明などの軽微な不正確さを無視する寛大なスコアリング基準であっても、主に基本的な学部レベルの建設的なタスクで、最高のパフォーマンスモデルであるO3 MINIが70％の精度を達成することを示しています。
より複雑な問題でパフォーマンスは急激に低下し、モデルは開かれた問題の実質的な戦略を提供できません。
これらの発見は、現在のLLMが馴染みのある程度の低い問題の問題についてある程度の建設的な習熟度を示しているが、そのようなパフォーマンスは、本物の創造的な洞察や新しい統合ではなく、記憶されたパターンの組換えに起因する可能性が高いことを示唆しています。

要約(オリジナル)

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs’ creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria — emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations — the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

arxiv情報

著者	Xiaoyang Chen,Xinan Dai,Yu Du,Qian Feng,Naixu Guo,Tingshuo Gu,Yuting Gao,Yingyi Gao,Xudong Han,Xiang Jiang,Yilin Jin,Hongyi Lin,Shisheng Lin,Xiangnan Li,Yuante Li,Yixing Li,Zhentao Lai,Zilu Ma,Yingrong Peng,Jiacheng Qian,Hao-Yu Sun,Jianbo Sun,Zirui Wang,Siwei Wu,Zian Wang,Bin Xu,Jianghao Xu,Yiyang Yu,Zichuan Yang,Hongji Zha,Ruichong Zhang
発行日	2025-05-13 16:58:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー