MastermindEval: A Simple But Scalable Reasoning Benchmark

要約

大規模な言語モデル（LLM）の最近の進歩により、幅広い言語の理解と数学的タスクにわたって顕著なパフォーマンスが発生しました。
その結果、LLMSの真の推論能力の評価に注意が高まっています。
ただし、OpenaiのO1やDeepseekのR1などの推論に焦点を当てたモデルの急速な進歩により、進行中のモデル開発に対応できる推論ベンチマークに対する需要が高まっています。
この論文では、ボードゲームの首謀者に触発されたシンプルでスケーラブルで解釈可能な演ductiveな推論ベンチマークであるMasterMindevalを紹介します。
私たちのベンチマークは、モデルが自律的にゲームを再生するエージェント評価、および（2）推測する可能性のある有効なコードのみを備えた事前にプレイされたゲーム状態を与えられる演ductive的な推論評価の2つの評価パラダイムをサポートしています。
実験結果では、（1）簡単な首謀者インスタンスでさえ現在のモデルにとって困難であることを発見し、（2）ベンチマークが将来的により高度なモデルにとってスケーラブルであることを示しています。さらに、モデルが最終的なソリューションを推定できない理由を調査し、現在のモデルが情報を組み合わせる声明の数として隠されたコードを推定する際に現在のモデルが制限されていることを発見します。

要約(オリジナル)

Recent advancements in large language models (LLMs) have led to remarkable performance across a wide range of language understanding and mathematical tasks. As a result, increasing attention has been given to assessing the true reasoning capabilities of LLMs, driving research into commonsense, numerical, logical, and qualitative reasoning. However, with the rapid progress of reasoning-focused models such as OpenAI’s o1 and DeepSeek’s R1, there has been a growing demand for reasoning benchmarks that can keep pace with ongoing model developments. In this paper, we introduce MastermindEval, a simple, scalable, and interpretable deductive reasoning benchmark inspired by the board game Mastermind. Our benchmark supports two evaluation paradigms: (1) agentic evaluation, in which the model autonomously plays the game, and (2) deductive reasoning evaluation, in which the model is given a pre-played game state with only one possible valid code to infer. In our experimental results we (1) find that even easy Mastermind instances are difficult for current models and (2) demonstrate that the benchmark is scalable to possibly more advanced models in the future Furthermore, we investigate possible reasons why models cannot deduce the final solution and find that current models are limited in deducing the concealed code as the number of statement to combine information from is increasing.

arxiv情報

著者	Jonas Golde,Patrick Haller,Fabio Barth,Alan Akbik
発行日	2025-03-12 15:02:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MastermindEval: A Simple But Scalable Reasoning Benchmark

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー