Improving Neuron-level Interpretability with White-box Language Models

要約

GPT-2のような自動再帰言語モデルのニューロンは、その活性化パターンを分析することで解釈できます。
最近の研究では、事後スパースコーディングの一形態である辞書学習などの手法が、このニューロンレベルの解釈可能性を高めることが示されています。
私たちの研究では、後付けとして適用するのではなく、モデルアーキテクチャにまばらなコーディングを直接埋め込むことにより、ニューラルネットワークの解釈性を根本的に改善するという目標に基づいています。
私たちの研究では、データ分布内でまばらで低次元構造をキャプチャするように明示的に設計されたコーディングレートトランス（CRATE）という名前のホワイトボックストランスのようなアーキテクチャを導入します。
当社の包括的な実験では、さまざまな評価メトリックにわたってニューロンレベルの解釈可能性の大幅な改善（最大103％の相対改善）を示しています。
詳細な調査により、この強化された解釈可能性は、モデルサイズに関係なく異なる層にわたって安定していることが確認されており、ニューラルネットワークの解釈可能性を高めるためのクレートの堅牢なパフォーマンスを強調しています。
さらなる分析により、Crateの解釈性の向上は、関連するトークンで一貫して明確にアクティブ化する能力を強化することから得られることが示されています。
これらの発見は、ニューロンレベルの解釈に優れているホワイトボックスの基礎モデルを作成するための有望な方向性を示しています。

要約(オリジナル)

Neurons in auto-regressive language models like GPT-2 can be interpreted by analyzing their activation patterns. Recent studies have shown that techniques such as dictionary learning, a form of post-hoc sparse coding, enhance this neuron-level interpretability. In our research, we are driven by the goal to fundamentally improve neural network interpretability by embedding sparse coding directly within the model architecture, rather than applying it as an afterthought. In our study, we introduce a white-box transformer-like architecture named Coding RAte TransformEr (CRATE), explicitly engineered to capture sparse, low-dimensional structures within data distributions. Our comprehensive experiments showcase significant improvements (up to 103% relative improvement) in neuron-level interpretability across a variety of evaluation metrics. Detailed investigations confirm that this enhanced interpretability is steady across different layers irrespective of the model size, underlining CRATE’s robust performance in enhancing neural network interpretability. Further analysis shows that CRATE’s increased interpretability comes from its enhanced ability to consistently and distinctively activate on relevant tokens. These findings point towards a promising direction for creating white-box foundation models that excel in neuron-level interpretation.

arxiv情報

著者	Hao Bai,Yi Ma
発行日	2025-02-27 15:22:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Neuron-level Interpretability with White-box Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー