Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving

要約

大規模な言語モデル（LLM）は多くの自然言語タスクで優れていますが、特に象徴的な推論と一貫した出力の維持において、複雑な数学的な問題解決に苦労しています。
この研究は、数学データセットからの945の競合レベルの問題を使用して、7〜80億のパラメーターで10 LLMを評価します。
焦点は、9,450を超えるコード実行を含む、推論プロセスのステップとして実行可能なPythonコードを生成する能力にあります。
この研究では、Mistral-Large-2411を使用して評価フレームワークを導入して、5段階のスケールで回答を評価します。これは、数学的表記の矛盾に対処するのに役立ちます。
また、結果を調整する結果に対するトークンごとに再生成されることの影響を調べます。
調査結果は、トップの商業モデル（GPT-4O-MINI、スコア83.7％）と最も効果的なオープンソースモデル（オープンコデストラマンバ：V0.1、得点49.2％）の間の有意な34.5％のフォーマンスギャップを明らかにしています。
。
この格差は、数の理論のような複雑な領域で特に顕著です。
トークンごとの再生により、モデルllama3.1：8bの精度がわずかに改善されましたが（+0.8％）、コード実行時間も36.7％短縮し、効率と精度のトレードオフを強調しました。
また、この研究では、すべてのモデルでより困難な問題がより低い精度と相関する一貫した傾向にも注目しました。
制御された実行環境を使用しているにもかかわらず、生成されたコードの1％未満が安全ではなく、10回の試行後に問題の3.17％が未解決のままであり、ハイブリッド推論方法が有益である可能性があることを示唆しています。

要約(オリジナル)

Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

arxiv情報

著者	Evgenii Evstafev
発行日	2025-01-28 17:11:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー