Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

要約

最近開発された大規模言語モデル (LLM) は、幅広い言語理解タスクで驚くほどうまく機能することが示されています。
しかし、彼らは本当に自然言語を「推論」できるのでしょうか?
この質問は研究で大きな注目を集めており、常識、数値、定性などの多くの推論スキルが研究されています。
しかし、「論理的推論」に関する重要なスキルはまだ解明されていません。
LLM のこの推論能力を調査する既存の研究は、命題論理と一次論理のいくつかの推論ルール (法的ポーネンと法的トーレンスなど) にのみ焦点を当ててきました。
上記の制限に対処するため、命題論理、一次論理、および非単調論理にわたる 25 の異なる推論パターンに関する LLM の論理推論能力を包括的に評価します。
体系的な評価を可能にするために、単一の推論ルールの使用に焦点を当てた自然言語質問応答データセットである LogicBench を導入します。
GPT-4、ChatGPT、Gemini、Llama-2、Mistral などのさまざまな LLM を使用して、思考連鎖プロンプトを使用して詳細な分析を実行します。
実験結果は、既存の LLM が LogicBench ではうまく機能しないことを示しています。
特に、複雑な推論や否定が含まれる事例に苦労します。
さらに、正しい結論に到達するための推論に必要な文脈情報を見落とすことがあります。
私たちは、私たちの研究と発見が、LLM の論理的推論能力を評価し強化するための将来の研究を促進すると信じています。
データとコードは https://github.com/Mihir3009/LogicBench で入手できます。

要約(オリジナル)

Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really ‘reason’ over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied. However, the crucial skill pertaining to ‘logical reasoning’ has remained underexplored. Existing work investigating this reasoning ability of LLMs has focused only on a couple of inference rules (such as modus ponens and modus tollens) of propositional and first-order logic. Addressing the above limitation, we comprehensively evaluate the logical reasoning ability of LLMs on 25 different reasoning patterns spanning over propositional, first-order, and non-monotonic logics. To enable systematic evaluation, we introduce LogicBench, a natural language question-answering dataset focusing on the use of a single inference rule. We conduct detailed analysis with a range of LLMs such as GPT-4, ChatGPT, Gemini, Llama-2, and Mistral using chain-of-thought prompting. Experimental results show that existing LLMs do not fare well on LogicBench; especially, they struggle with instances involving complex reasoning and negations. Furthermore, they sometimes overlook contextual information necessary for reasoning to arrive at the correct conclusion. We believe that our work and findings facilitate future research for evaluating and enhancing the logical reasoning ability of LLMs. Data and code are available at https://github.com/Mihir3009/LogicBench.

arxiv情報

著者	Mihir Parmar,Nisarg Patel,Neeraj Varshney,Mutsumi Nakamura,Man Luo,Santosh Mashetty,Arindam Mitra,Chitta Baral
発行日	2024-04-23 21:08:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー