Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

要約

言語モデル（LMS）を評価するための心（TOM）のタスクへの関心が高まっているにもかかわらず、LMSは自己や他者の精神状態を内部的に表す方法についてはほとんど知られていません。
これらの内部メカニズムを理解することは重要です – 表面レベルのパフォーマンスを超えて移動するだけでなく、モデルのアラインメントと安全性のために、精神状態の微妙な誤った違いが生成された出力で検出されない可能性があります。
この作業では、さまざまなスケール、トレーニングレジメン、およびプロンプトにわたってモデルを調査することにより、LMSの信念表現の最初の体系的な調査を提示します – 制御タスクを使用して交絡を除外します。
私たちの実験は、モデルサイズと微調整の両方が、他の人の信念のLMSの内部表現を大幅に改善するという証拠を提供します。これらは、偽の相関の単なる副産物ではなく、変化を促す脆弱です。
重要なことに、これらの表現を強化できることを示します。モデルのアクティベーションへのターゲット編集は、間違ったTOM推論を修正する可能性があります。

要約(オリジナル)

Despite growing interest in Theory of Mind (ToM) tasks for evaluating language models (LMs), little is known about how LMs internally represent mental states of self and others. Understanding these internal mechanisms is critical – not only to move beyond surface-level performance, but also for model alignment and safety, where subtle misattributions of mental states may go undetected in generated outputs. In this work, we present the first systematic investigation of belief representations in LMs by probing models across different scales, training regimens, and prompts – using control tasks to rule out confounds. Our experiments provide evidence that both model size and fine-tuning substantially improve LMs’ internal representations of others’ beliefs, which are structured – not mere by-products of spurious correlations – yet brittle to prompt variations. Crucially, we show that these representations can be strengthened: targeted edits to model activations can correct wrong ToM inferences.

arxiv情報

著者	Matteo Bortoletto,Constantin Ruhdorfer,Lei Shi,Andreas Bulling
発行日	2025-05-19 16:43:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー