Generative Evaluation of Complex Reasoning in Large Language Models

要約

強力な大規模な言語モデル（LLM）が超人的な推論能力を実証しているため、重要な疑問が生じます。LLMSは真に推論しますか、それとも広範囲にわたるWebスクレイプのトレーニングデータセットから答えを思い出しますか？
公的にリリースされたベンチマークは、その後のLLMトレーニングセットに組み込まれると必然的に汚染され、信頼性を忠実な評価として損ないます。
これに対処するために、LLMSの推論を評価するために特別に設計された生成評価フレームワークであるKumoを紹介します。
Kumoは、LLMSとシンボリックエンジンを相乗的に組み合わせて、部分的に観察可能で困難な調整可能な多様な多ターン推論タスクを動的に生成します。
自動化されたパイプラインを通じて、クモはオープンエンドのドメイン全体で新しいタスクを継続的に生成し、記憶よりも純粋な一般化を実証するために魅力的なモデルを生成します。
Kumoによって作成された100のドメインにわたって5,000のタスクで23の最先端のLLMを評価し、大学生に対する推論能力をベンチマークしました。
私たちの調査結果は、多くのLLMが簡単な推論タスクに関する大学レベルのパフォーマンスを上回っていることを明らかにしており、推論されたLLMSが複雑な推論の課題で大学レベルのパフォーマンスに到達しています。
さらに、KumoタスクのLLMパフォーマンスは、新しくリリースされた現実世界の推論ベンチマークの結果と強く相関しており、クモの価値を真のLLM推論機能の堅牢で永続的な評価ツールとして強調しています。

要約(オリジナル)

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO’s value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

arxiv情報

著者	Haowei Lin,Xiangyu Wang,Ruilin Yan,Baizhou Huang,Haotian Ye,Jianhua Zhu,Zihao Wang,James Zou,Jianzhu Ma,Yitao Liang
発行日	2025-04-25 12:02:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generative Evaluation of Complex Reasoning in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー