MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

要約

大規模な言語モデル（LLMS）は、高精度で算術的な単語の問題を解決できますが、より複雑な問題にどれだけよく一般化するかについてはほとんど知られていません。
（i）利用可能な評価データの多くは、トレーニング中に最も有能なモデルですでに見られているため、（ii）既存のベンチマークは、問題の証明がさまざまな方法で任意に複雑である可能性があるため、既存のベンチマークを把握していないため、研究が困難です。
この論文では、Mathgapと呼ばれる任意の複雑な算術証明の問題に関するLLMを評価するためのデータ生成フレームワークを紹介します。
Mathgapは、算術的な証明構造に関する仕様に従って問題の声明と考え方の推論の痕跡を生成し、樹木の複雑さに関する容易な一般化に関する体系的な研究を可能にします。
Mathgapを使用すると、LLMSは、証明がより深く、より広くなるにつれて、パフォーマンスの大幅な減少を示していることがわかります。
この効果は、複雑で非線形の証明構造でより顕著であり、最も有能なモデルでも挑戦的です。
モデルは、文の順序付けの単純な変更にも敏感です。
しかし、彼らはいくつかの複雑な問題を解決することができ、推論の一般化はうるさいことを示唆しています。

要約(オリジナル)

Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.

arxiv情報

著者	Andreas Opedal,Haruki Shirakami,Bernhard Schölkopf,Abulhair Saparov,Mrinmaya Sachan
発行日	2025-02-14 18:15:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー