Enhancing Automated Interpretability with Output-Centric Feature Descriptions

要約

自動化された解釈可能性パイプラインは、植物や文の最初の単語など、大規模な言語モデル（LLM）の機能によって表される概念の自然言語の説明を生成します。
これらの説明は、モデルの表現空間の寸法または方向である可能性のある機能をアクティブにする入力を使用して導出されます。
ただし、アクティブ化入力の識別は費用がかかり、モデルの動作における機能の機械的役割は、入力が機能をアクティブにする方法と、機能のアクティベーションが出力にどのように影響するかの両方によって決定されます。
ステアリング評価を使用して、現在のパイプラインが出力に対する特徴の因果効果をキャプチャできない説明を提供することを明らかにします。
これを修正するために、機能の説明を自動的に生成するための効率的な出力中心の方法を提案します。
これらの方法は、特徴刺激後に高く重み付けされたトークンまたは最高の重量トークンを使用して、語彙の「具体化されていない」ヘッドをこの機能に直接適用しました。
出力中心の説明は、入力中心の説明よりもモデル出力に対する特徴の因果効果をよりよくキャプチャしますが、2つを入力評価と出力評価の両方で最高のパフォーマンスに導きます。
最後に、出力中心の説明を使用して、以前は「死んでいる」と考えられていた機能をアクティブ化する入力を見つけることができることを示します。

要約(オリジナル)

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model’s representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary ‘unembedding’ head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be ‘dead’.

arxiv情報

著者	Yoav Gur-Arieh,Roy Mayan,Chen Agassy,Atticus Geiger,Mor Geva
発行日	2025-05-29 15:26:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー