JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

要約

論理的推論は、大規模な言語モデル（LLM）の重要な要素であり、近年の実質的な研究努力は、演ductive控除能力を強化することを目的としています。
ただし、LLMSの評価と前進に不可欠な既存の演ductiveな推論ベンチマークは、タスクの複雑さの欠如、交絡因子としての事前知識の存在、および表面的なエラー分析のために不十分です。
これらの欠陥に対処するために、LLMの厳密な評価のために設計された合成的に生成された演ductiveな推論ベンチマークであるJustLogicを紹介します。
JustLogicは（i）非常に複雑で、多様な言語パターン、語彙、および引数構造を生成することができます。
（ii）事前知識は独立しており、事前知識を持っているモデルの利点を排除し、質問に答えるために演ductiveな推論のみが使用されることを保証する。
（iii）モデルの精度に対する推論の深さと議論形式の不均一な影響に関する詳細なエラー分析が可能です。
JustLogicでの実験結果は、（i）最先端の（SOTA）推論LLMが人間の平均よりもPARまたはそれ以上に機能するが、人間の天井よりも著しく悪いことを明らかにしています。
すべてのコードとデータは、https：//github.com/michaelchen-lab/justlogicで入手できます

要約(オリジナル)

Logical reasoning is a critical component of Large Language Models (LLMs), and substantial research efforts in recent years have aimed to enhance their deductive reasoning capabilities. However, existing deductive reasoning benchmarks, which are crucial for evaluating and advancing LLMs, are inadequate due to their lack of task complexity, presence of prior knowledge as a confounder, and superficial error analysis. To address these deficiencies, we introduce JustLogic, a synthetically generated deductive reasoning benchmark designed for rigorous evaluation of LLMs. JustLogic is (i) highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures; (ii) prior knowledge independent, eliminating the advantage of models possessing prior knowledge and ensuring that only deductive reasoning is used to answer questions; and (iii) capable of in-depth error analysis on the heterogeneous effects of reasoning depth and argument form on model accuracy. Our experimental results on JustLogic reveal that (i) state-of-the-art (SOTA) reasoning LLMs perform on par or better than the human average but significantly worse than the human ceiling, and (ii) SOTA non-reasoning models still underperform the human average. All code and data are available at https://github.com/michaelchen-lab/JustLogic

arxiv情報

著者	Michael K. Chen,Xikun Zhang,Dacheng Tao
発行日	2025-05-09 05:26:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー