Generative Evaluation of Complex Reasoning in Large Language Models

要約

強力な大規模言語モデル（LLM）が超人的な推論能力を示す中、重大な疑問が生じる：LLMは本当に推論しているのだろうか、それとも単にウェブで収集した膨大なトレーニングデータセットから答えを呼び出しているだけなのだろうか？公開されたベンチマークは、LLMのトレーニングセットに組み込まれると必然的に汚染され、忠実な評価としての信頼性が損なわれる。これに対処するため、LLMの推論を評価するために特別に設計された生成的評価フレームワークであるKUMOを紹介する。KUMOは、LLMと記号エンジンを相乗的に組み合わせ、部分的に観測可能で難易度を調整可能な、多様なマルチターン推論タスクを動的に生成する。KUMOは、自動化されたパイプラインを通じて、オープンエンドな領域にわたる新しいタスクを継続的に生成し、暗記ではなく真の汎化をモデルに実証させる。我々は、KUMOによって生成された100のドメインにわたる5,000のタスクについて、23の最先端のLLMを評価し、大学生に対する推論能力のベンチマークを行った。その結果、簡単な推論課題では多くのLLMが大学レベルの性能を上回り、複雑な推論課題では推論スケーリングされたLLMが大学レベルの性能に達することが明らかになった。さらに、KUMO課題におけるLLMの成績は、新たに発表された実世界の推論ベンチマークにおける成績と強い相関があり、KUMOが本物のLLMの推論能力を評価するための強固で永続的な評価ツールとしての価値を強調している。

要約(オリジナル)

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO’s value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

arxiv情報

著者	Haowei Lin,Xiangyu Wang,Ruilin Yan,Baizhou Huang,Haotian Ye,Jianhua Zhu,Zihao Wang,James Zou,Jianzhu Ma,Yitao Liang
発行日	2025-04-03 17:54:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Generative Evaluation of Complex Reasoning in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー