BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

要約

最新の言語モデル (LM) のパフォーマンスを比較するベンチマークは多数ありますが、最終タスクの評価では、*事実の正確さ* (「真実」) と *推論能力* (「合理性」、またはある意味での「誠実さ」) の概念が混同されることがよくあります。
信念の影響を正しく報告すること）。
私たちの目標は、これら 2 つの概念を明確に区別するデータセットです。
私たちのアプローチは、人間が注釈を付けた *含意ツリー* のコレクションを活用して拡張することです。これは、推論の良い連鎖と悪い連鎖の両方を表現するように設計されており、信念バイアスを避けるために、特に反事実の例を含む真と偽の事実の混合を使用します (
「コンテンツ効果」とも呼ばれます)。
BaRDa と呼ばれる結果のデータセットには、6681 個の true ステートメントと 2319 個の false ステートメントを使用した 3000 個の含意 (1787 個が有効、1213 個が無効) が含まれています。
4 つの GPT シリーズモデル、GPT3(curie)/GPT3(davinici)/3.5/4 でテストしたところ、事実精度 (真実) スコアは 74.1/80.6/82.6/87.1、推論精度スコアは 63.1/78.0/71.8/79.2 でした。
。
これは、事実の精度と含意推論の向上に向けてモデルが明確に進歩していることを示しており、データセットは、これら 2 つの概念をより明確に分離して定量化する新しいベンチマークを提供します。

要約(オリジナル)

While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* (‘truth’) and *reasoning ability* (‘rationality’, or ‘honesty’ in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the ‘content effect’). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions.

arxiv情報

著者	Peter Clark,Bhavana Dalvi Mishra,Oyvind Tafjord
発行日	2023-12-12 18:55:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー