Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

要約

大規模言語モデル (LLM) は、人間の決定を説明するために人間を模倣するようにトレーニングされます。
しかし、LLM はそれ自体を説明できるでしょうか?
これらは、LLM がさまざまな入力をどのように処理するかについて、人間がメンタルモデルを構築するのに役立つでしょうか?
これらの質問に答えるために、自然言語説明の $\textbf{反事実のシミュレーション可能性}$ を評価することを提案します。つまり、説明によって人間が、説明された入力のさまざまな反事実に基づいてモデルの出力を正確に推測できるかどうかです。
たとえば、モデルが「ワシは飛べますか?」という入力質問に「はい」と答えた場合、
「すべての鳥は飛べます」という説明があれば、人間はその説明から、「ペンギンは飛べますか?」という反事実的な入力にも「はい」と答えるだろうと推測します。
説明が正確であれば、モデルの答えは人間の期待と一致するはずです。
私たちは、反事実のシミュレーション可能性に基づいて、精度と一般性という 2 つの指標を実装しました。
LLM を使用して、さまざまな反事実を自動的に生成しました。
次に、これらのメトリクスを使用して、マルチホップの事実推論と報酬モデリングという 2 つのタスクで最先端の LLM (GPT-4 など) を評価しました。
私たちは、LLM の説明の精度が低く、精度がもっともらしさと相関しないことを発見しました。
したがって、人間の承認 (RLHF など) を単純に最適化するだけでは十分な解決策ではない可能性があります。

要約(オリジナル)

Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model’s outputs on diverse counterfactuals of the explained input. For example, if a model answers ‘yes’ to the input question ‘Can eagles fly?’ with the explanation ‘all birds can fly’, then humans would infer from the explanation that it would also answer ‘yes’ to the counterfactual input ‘Can penguins fly?’. If the explanation is precise, then the model’s answer should match humans’ expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs (e.g., GPT-4) on two tasks: multi-hop factual reasoning and reward modeling. We found that LLM’s explanations have low precision and that precision does not correlate with plausibility. Therefore, naively optimizing human approvals (e.g., RLHF) may not be a sufficient solution.

arxiv情報

著者	Yanda Chen,Ruiqi Zhong,Narutatsu Ri,Chen Zhao,He He,Jacob Steinhardt,Zhou Yu,Kathleen McKeown
発行日	2023-07-17 17:41:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー