Are self-explanations from Large Language Models faithful?

要約

命令調整された大規模言語モデル (LLM) は多くのタスクに優れており、その推論、いわゆる自己説明さえも説明します。
ただし、説得力のある間違った自己説明は、LLM に対する裏付けのない信頼につながり、リスクを増大させる可能性があります。
したがって、自己説明がモデルの動作を本当に反映しているかどうかを測定することが重要です。
このような尺度は解釈可能性の忠実性と呼ばれ、グラウンドトゥルースにアクセスできず、多くの LLM には推論 API しかないため、実行するのは困難です。
これに対処するために、忠実性を測定するために自己一貫性チェックを採用することを提案します。
たとえば、LLM が予測を行うために一連の単語が重要であると言っている場合、これらの単語なしでは予測を行うことができないはずです。
自己一貫性チェックは忠実性に対する一般的なアプローチですが、反事実、重要性の尺度、および編集の説明についての LLM 自己説明にはこれまでうまく適用されていませんでした。
私たちの結果は、忠実性は説明、モデル、タスクに依存することを示しており、自己説明は一般に信頼されるべきではないことを示しています。
たとえば、センチメント分類では、Llama2 では反事実が、Mistral では重要度が、Falcon 40B では編集がより忠実になります。

要約(オリジナル)

Instruction-tuned Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. However, convincing and wrong self-explanations can lead to unsupported confidence in LLMs, thus increasing risk. Therefore, it’s important to measure if self-explanations truly reflect the model’s behavior. Such a measure is called interpretability-faithfulness and is challenging to perform since the ground truth is inaccessible, and many LLMs only have an inference API. To address this, we propose employing self-consistency checks to measure faithfulness. For example, if an LLM says a set of words is important for making a prediction, then it should not be able to make its prediction without these words. While self-consistency checks are a common approach to faithfulness, they have not previously been successfully applied to LLM self-explanations for counterfactual, importance measure, and redaction explanations. Our results demonstrate that faithfulness is explanation, model, and task-dependent, showing self-explanations should not be trusted in general. For example, with sentiment classification, counterfactuals are more faithful for Llama2, importance measures for Mistral, and redaction for Falcon 40B.

arxiv情報

著者	Andreas Madsen,Sarath Chandar,Siva Reddy
発行日	2024-02-15 17:19:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Are self-explanations from Large Language Models faithful?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー