The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

要約

大規模言語モデル (LLM) は優れた機能を備えていますが、虚偽を出力する傾向もあります。
最近の研究では、LLM の内部アクティベーションに関するプローブをトレーニングすることによって、LLM が真実を語っているかどうかを推論する技術が開発されました。
しかし、この一連の研究は物議を醸しており、一部の著者は、概念的な問題の中でも特に、これらのプローブが基本的な方法で一般化できないことを指摘しています。
この研究では、真/偽のステートメントの高品質なデータセットを厳選し、それらを使用して、真実の LLM 表現の構造を詳細に研究し、次の 3 つの証拠を利用します。 1. LLM の真/偽のステートメント表現の視覚化。
明確な直線構造。
2. 1 つのデータセットでトレーニングされたプローブを別のデータセットに一般化する転送実験。
3. LLM のフォワードパスに外科的に介入することで得られた因果関係の証拠。LLM に虚偽の発言を真実として扱い、またその逆を行わせる。
全体として、言語モデルが事実の記述の真偽を線形に表現するという証拠を示します。
また、質量平均プローブという新しい手法も紹介します。これは、他のプローブ手法よりも一般化が容易で、モデル出力への因果関係がより深く関係しています。

要約(オリジナル)

Large Language Models (LLMs) have impressive capabilities, but are also prone to outputting falsehoods. Recent work has developed techniques for inferring whether a LLM is telling the truth by training probes on the LLM’s internal activations. However, this line of work is controversial, with some authors pointing out failures of these probes to generalize in basic ways, among other conceptual issues. In this work, we curate high-quality datasets of true/false statements and use them to study in detail the structure of LLM representations of truth, drawing on three lines of evidence: 1. Visualizations of LLM true/false statement representations, which reveal clear linear structure. 2. Transfer experiments in which probes trained on one dataset generalize to different datasets. 3. Causal evidence obtained by surgically intervening in a LLM’s forward pass, causing it to treat false statements as true and vice versa. Overall, we present evidence that language models linearly represent the truth or falsehood of factual statements. We also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.

arxiv情報

著者	Samuel Marks,Max Tegmark
発行日	2023-10-10 17:54:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー