FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

要約

私たちは、専門の数学者によって作成され、精査された、何百ものオリジナルの非常に難しい数学問題のベンチマークである FrontierMath を紹介します。
問題は、数論や実解析における計算集約的な問題から、代数幾何学や圏論の抽象的な問題まで、現代数学のほとんどの主要な分野をカバーしています。
典型的な問題を解くには、数学の関連分野の研究者は数時間の努力を必要とし、上級問題の場合は数日かかります。
FrontierMath は、新しい未公開の問題と自動検証を使用して、データ汚染のリスクを最小限に抑えながらモデルを確実に評価します。
現在の最先端の AI モデルが解決できる問題は 2% 未満であり、AI の能力と数学コミュニティの能力との間に大きなギャップがあることが明らかになりました。
AI システムが専門家レベルの数学的能力に向けて進歩するにつれて、FrontierMath はその進歩を定量化する厳密なテストベッドを提供します。

要約(オリジナル)

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics — from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

arxiv情報

著者	Elliot Glazer,Ege Erdil,Tamay Besiroglu,Diego Chicharro,Evan Chen,Alex Gunning,Caroline Falkman Olsson,Jean-Stanislas Denain,Anson Ho,Emily de Oliveira Santos,Olli Järviniemi,Matthew Barnett,Robert Sandler,Matej Vrzala,Jaime Sevilla,Qiuyu Ren,Elizabeth Pratt,Lionel Levine,Grant Barkley,Natalie Stewart,Bogdan Grechuk,Tetiana Grechuk,Shreepranav Varma Enugandla,Mark Wildon
発行日	2024-11-14 16:26:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー