Language Models Fail to Introspect About Their Knowledge of Language

要約

大規模な言語モデル（LLM）が自分の内部状態について内省できるかどうかに最近の関心があります。
このような能力は、LLMをより解釈しやすくし、モデルの文法的知識を評価するために言語学における標準的な内省的方法の使用を検証します（たとえば、「この文は文法ですか？」と尋ねます）。
21のオープンソースLLMSにわたる緊急内省を体系的に調査します。これは、内省が理論的に関心がある2つのドメインである文法的知識と単語予測です。
重要なことに、両方のドメインで、モデルの内部言語知識は、弦の確率の直接的な測定に理論的に根拠がある可能性があります。
次に、Metalinguisticプロンプトに対するモデルの応答が内部知識を忠実に反映しているかどうかを評価します。
内省の新しい尺度を提案します。モデルの促された応答が、ほぼ同一の内部知識を持つ別のモデルによって予測されるものを超えて、独自の文字列確率を予測する程度です。
Metalinguisticのプロンプトと確率の比較の両方が高いタスクの精度につながりますが、LLMが「自己アクセス」に特権を与えているという証拠は見つかりません。
私たちの調査結果は、モデルが内省する可能性があることを示唆する最近の結果を複雑にし、応答を促したという議論に新しい証拠を追加することが、モデルの言語一般化と混同されるべきではありません。

要約(オリジナル)

There has been recent interest in whether large language models (LLMs) can introspect about their own internal states. Such abilities would make LLMs more interpretable, and also validate the use of standard introspective methods in linguistics to evaluate grammatical knowledge in models (e.g., asking ‘Is this sentence grammatical?’). We systematically investigate emergent introspection across 21 open-source LLMs, in two domains where introspection is of theoretical interest: grammatical knowledge and word prediction. Crucially, in both domains, a model’s internal linguistic knowledge can be theoretically grounded in direct measurements of string probability. We then evaluate whether models’ responses to metalinguistic prompts faithfully reflect their internal knowledge. We propose a new measure of introspection: the degree to which a model’s prompted responses predict its own string probabilities, beyond what would be predicted by another model with nearly identical internal knowledge. While both metalinguistic prompting and probability comparisons lead to high task accuracy, we do not find evidence that LLMs have privileged ‘self-access’. Our findings complicate recent results suggesting that models can introspect, and add new evidence to the argument that prompted responses should not be conflated with models’ linguistic generalizations.

arxiv情報

著者	Siyuan Song,Jennifer Hu,Kyle Mahowald
発行日	2025-03-10 16:33:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Models Fail to Introspect About Their Knowledge of Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー