Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

要約

機械的解釈可能性の中心的な目標は、その出力を因果的に説明する大規模な言語モデル（LLMS）の適切な分析単位を特定することです。
初期の研究は個々のニューロンに焦点を当てていましたが、ニューロンがしばしば複数の概念をエンコードしているという証拠は、活性化空間の方向の分析への移行を動機付けました。
重要な質問は、監視されていない方法で解釈可能な機能をキャプチャする方向を見つける方法です。
現在の方法は、スパースの自動エンコーダー（SAE）を使用した辞書学習に依存しており、一般的に残留ストリームのアクティベーションで訓練され、ゼロからの方向性を学習します。
ただし、SAEはしばしば因果評価に苦労し、その学習がモデルの計算に明示的に結び付けられていないため、本質的な解釈可能性を欠いています。
ここでは、MLPの活性化を半同意的なマトリックス因数分解（SNMF）で直接分解することにより、これらの制限に取り組みます。これにより、学習された特徴は（a）同時活性ニューロンのまばらな線形結合、（b）活性化入力にマッピングされ、直接解釈可能になります。
Llama 3.1、Gemma 2、およびGPT-2での実験は、SNMF派生機能がSAEと因果ステアリングの強力な監視されたベースライン（違い）を上回り、人間の解釈可能な概念に合わせて誘導することを上回ることを示しています。
さらなる分析により、特定のニューロンの組み合わせが意味的に関連した機能全体で再利用され、MLPの活性化空間に階層構造が露出していることが明らかになりました。
一緒に、これらの結果は、SNMFを、解釈可能な機能を特定し、LLMの概念表現を分析するためのシンプルで効果的なツールとして位置付けています。

要約(オリジナル)

A central goal for mechanistic interpretability has been to identify the right units of analysis in large language models (LLMs) that causally explain their outputs. While early work focused on individual neurons, evidence that neurons often encode multiple concepts has motivated a shift toward analyzing directions in activation space. A key question is how to find directions that capture interpretable features in an unsupervised manner. Current methods rely on dictionary learning with sparse autoencoders (SAEs), commonly trained over residual stream activations to learn directions from scratch. However, SAEs often struggle in causal evaluations and lack intrinsic interpretability, as their learning is not explicitly tied to the computations of the model. Here, we tackle these limitations by directly decomposing MLP activations with semi-nonnegative matrix factorization (SNMF), such that the learned features are (a) sparse linear combinations of co-activated neurons, and (b) mapped to their activating inputs, making them directly interpretable. Experiments on Llama 3.1, Gemma 2 and GPT-2 show that SNMF derived features outperform SAEs and a strong supervised baseline (difference-in-means) on causal steering, while aligning with human-interpretable concepts. Further analysis reveals that specific neuron combinations are reused across semantically-related features, exposing a hierarchical structure in the MLP’s activation space. Together, these results position SNMF as a simple and effective tool for identifying interpretable features and dissecting concept representations in LLMs.

arxiv情報

著者	Or Shafran,Atticus Geiger,Mor Geva
発行日	2025-06-12 17:33:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Decomposing MLP Activations into Interpretable Features via Semi-Nonnegative Matrix Factorization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー