Enhancing Automated Interpretability with Output-Centric Feature Descriptions

要約

自動解釈パイプラインは、植物や文の最初の単語など、大規模言語モデル (LLM) の特徴によって表される概念の自然言語記述を生成します。
これらの記述は、モデルの表現空間内の寸法または方向など、機能をアクティブにする入力を使用して導出されます。
ただし、アクティブ化する入力の特定にはコストがかかり、モデルの動作における機能の機構的な役割は、入力によって機能がどのようにアクティブ化されるか、および機能のアクティブ化が出力にどのような影響を与えるかによって決まります。
ステアリング評価を使用すると、現在のパイプラインが出力に対する機能の因果関係を捉えることができない記述を提供していることが明らかになります。
これを修正するために、機能の説明を自動的に生成するための効率的で出力中心の方法を提案します。
これらの方法では、特徴刺激後により重み付けされたトークン、または語彙の「埋め込み解除」ヘッドを特徴に直接適用した後に最も重み付けされたトークンが使用されます。
出力中心の記述は、入力中心の記述よりもモデル出力に対する特徴の因果関係をよりよく捉えていますが、この 2 つを組み合わせることで、入力と出力の両方の評価で最高のパフォーマンスが得られます。
最後に、出力中心の記述を使用して、これまで「機能しなくなった」と考えられていた機能をアクティブにする入力を見つけることができることを示します。

要約(オリジナル)

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model’s representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary ‘unembedding’ head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be ‘dead’.

arxiv情報

著者	Yoav Gur-Arieh,Roy Mayan,Chen Agassy,Atticus Geiger,Mor Geva
発行日	2025-01-14 18:53:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー