Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

要約

順次控除を含む1次ロジック（FOL）推論は、インテリジェントシステムにとって極めて重要であり、特に考え方（COT）コンテキストで、推論能力を評価するための貴重なタスクとして機能します。
既存のベンチマークは、多くの場合、広範な人間の注釈または手作りのテンプレートに依存しているため、堅牢な評価に必要な複雑さ、スケーラビリティ、多様性を実現することが困難です。
これらの制限に対処するために、大規模な言語モデル（LLM）の生成強度をシンボリックプロバーの厳密さと精度と相乗的に相乗するProvergenと呼ばれる新しいフレームワークを提案します。
Proverqa。
Proverqaは、各問題のアクセス可能で論理的に一貫性のある中間推論ステップを含めることによっても区別されます。
私たちの評価は、最先端のLLMSが、COTのプロンプトがあり、データセットの挑戦的な性質を強調している場合でも、Proverqaの問題を解決するのに苦労していることを示しています。
また、フレームワークによって生成された別のトレーニングセットにllama3.1-8b-instructを獲得します。
Finetunedモデルは、分散内および分散型テストセットの両方で一貫した改善を示し、提案されたデータ生成フレームワークの価値を示唆しています。
https://github.com/opendatalab/provergenで利用可能なコード

要約(オリジナル)

First-order logic (FOL) reasoning, which involves sequential deduction, is pivotal for intelligent systems and serves as a valuable task for evaluating reasoning capabilities, particularly in chain-of-thought (CoT) contexts. Existing benchmarks often rely on extensive human annotation or handcrafted templates, making it difficult to achieve the necessary complexity, scalability, and diversity for robust evaluation. To address these limitations, we propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models (LLMs) with the rigor and precision of symbolic provers, enabling the creation of a scalable, diverse, and high-quality FOL reasoning dataset, ProverQA. ProverQA is also distinguished by its inclusion of accessible and logically coherent intermediate reasoning steps for each problem. Our evaluation shows that state-of-the-art LLMs struggle to solve ProverQA problems, even with CoT prompting, highlighting the dataset’s challenging nature. We also finetune Llama3.1-8B-Instruct on a separate training set generated by our framework. The finetuned model demonstrates consistent improvements on both in-distribution and out-of-distribution test sets, suggesting the value of our proposed data generation framework. Code available at: https://github.com/opendatalab/ProverGen

arxiv情報

著者	Chengwen Qi,Ren Ma,Bowen Li,He Du,Binyuan Hui,Jinwang Wu,Yuanjun Laili,Conghui He
発行日	2025-02-10 15:31:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー