Universal Neurons in GPT2 Language Models

要約

機械的解釈可能性という新たな分野における基本的な問題は、ニューラルネットワークが同じ基礎的なメカニズムをどの程度学習するかということです。
言い換えれば、神経メカニズムはさまざまなモデル間で普遍的なものなのでしょうか?
この研究では、普遍的なニューロンは解釈可能である可能性が高いという仮説に動機付けられ、さまざまな初期ランダムシードからトレーニングされた GPT2 モデルにわたる個々のニューロンの普遍性を研究します。
特に、5 つの異なるシードにわたるすべてのニューロンペアについて、1 億トークンにわたるニューロン活性化のペアワイズ相関を計算したところ、ニューロンの 1 ～ 5\% が普遍的であること、つまり、同じ入力で一貫して活性化するニューロンのペアであることがわかりました。
次に、これらの普遍的なニューロンを詳細に研究し、通常、それらが明確な解釈を持ち、少数のニューロンファミリーに分類されることを発見しました。
最後に、ニューロンの重みのパターンを研究して、単純な回路におけるニューロンのいくつかの普遍的な機能的役割を確立します。つまり、アテンションヘッドの非活性化、次のトークン分布のエントロピーの変更、次のトークンが特定のセット内に含まれない（含まれない）と予測するなどです。

要約(オリジナル)

A basic question within the emerging field of mechanistic interpretability is the degree to which neural networks learn the same underlying mechanisms. In other words, are neural mechanisms universal across different models? In this work, we study the universality of individual neurons across GPT2 models trained from different initial random seeds, motivated by the hypothesis that universal neurons are likely to be interpretable. In particular, we compute pairwise correlations of neuron activations over 100 million tokens for every neuron pair across five different seeds and find that 1-5\% of neurons are universal, that is, pairs of neurons which consistently activate on the same inputs. We then study these universal neurons in detail, finding that they usually have clear interpretations and taxonomize them into a small number of neuron families. We conclude by studying patterns in neuron weights to establish several universal functional roles of neurons in simple circuits: deactivating attention heads, changing the entropy of the next token distribution, and predicting the next token to (not) be within a particular set.

arxiv情報

著者	Wes Gurnee,Theo Horsley,Zifan Carl Guo,Tara Rezaei Kheirkhah,Qinyi Sun,Will Hathaway,Neel Nanda,Dimitris Bertsimas
発行日	2024-01-22 18:11:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Universal Neurons in GPT2 Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー