SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

要約

Satbenchを紹介します。これは、ブールの満足度（SAT）の問題から派生した論理パズルを使用して、大規模な言語モデル（LLMS）の論理的推論機能を評価するためのベンチマークです。
多くの場合、一連の施設から結論を推論することを伴う推論ルールベースの推論に焦点を当てた以前の研究とは異なり、私たちのアプローチは、指定された一連の論理的制約を満たすソリューションを見つけることを目的とするSAT問題の検索ベースの性質を活用します。
Satbenchの各インスタンスは、SAT式から生成され、LLMSを使用してストーリーコンテキストと条件に変換されます。
生成プロセスは完全に自動化されており、条項の数を変えることにより、調整可能な難易度が可能になります。
すべての2100パズルは、サブセットでの人間の検証により、LLMアシストとソルバーベースの一貫性チェックの両方を通じて検証されます。
実験結果は、最も強力なモデルであるO4-MINIでさえ、ランダムなベースラインの50％に近い、ハードUNSATの問題で65.0％の精度しか得られないことを示しています。
Satbenchは、現在のLLMの検索ベースの論理推論能力の基本的な制限を公開し、論理推論における将来の研究のためのスケーラブルなテストベッドを提供します。

要約(オリジナル)

We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference rule-based reasoning, which often involves deducing conclusions from a set of premises, our approach leverages the search-based nature of SAT problems, where the objective is to find a solution that fulfills a specified set of logical constraints. Each instance in SATBench is generated from a SAT formula, then translated into a story context and conditions using LLMs. The generation process is fully automated and allows for adjustable difficulty by varying the number of clauses. All 2100 puzzles are validated through both LLM-assisted and solver-based consistency checks, with human validation on a subset. Experimental results show that even the strongest model, o4-mini, achieves only 65.0% accuracy on hard UNSAT problems, close to the random baseline of 50%. SATBench exposes fundamental limitations in the search-based logical reasoning abilities of current LLMs and provides a scalable testbed for future research in logical reasoning.

arxiv情報

著者	Anjiang Wei,Yuheng Wu,Yingjia Wan,Tarun Suresh,Huanmi Tan,Zhanke Zhou,Sanmi Koyejo,Ke Wang,Alex Aiken
発行日	2025-05-20 17:00:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー