Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

要約

Mechanistic Interpretability は、ニューラルネットワークの重みと活性化を研究することによって、ニューラルネットワークによって実装されたアルゴリズムをリバースエンジニアリングすることを目的としています。
ニューラルネットワークのリバースエンジニアリングに対する障害は、ネットワーク内のパラメーターの多くが、ネットワークによって実装される計算に関与していないことです。
これらの縮退パラメータは内部構造を難読化する可能性があります。
特異学習理論は、ニューラルネットワークのパラメータ化がより縮退する方向に偏っており、より縮退したパラメータ化はさらに一般化する可能性があることを教えてくれます。
ネットワークパラメーターが縮退する可能性がある 3 つの方法を特定します。1 つは層内のアクティベーション間の線形依存です。
レイヤーに戻される勾配間の線形依存性。
データポイントの同じサブセット上で起動する ReLU。
また、モジュール型ネットワークはより縮退する可能性が高いというヒューリスティックな議論も提示し、この議論に基づいてネットワーク内のモジュールを識別するためのメトリックを開発します。
縮退を利用する再パラメータ化に対して不変な方法でニューラルネットワークを表現できれば、この表現はより解釈可能になる可能性が高く、そのような表現では相互作用がより疎になる可能性が高いという証拠をいくつか提供します。
活性化またはヤコビアンの線形依存から縮退に対して不変な表現を取得するための扱いやすい手法である相互作用基底を紹介します。

要約(オリジナル)

Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

arxiv情報

著者	Lucius Bushnaq,Jake Mendel,Stefan Heimersheim,Dan Braun,Nicholas Goldowsky-Dill,Kaarel Hänni,Cindy Wu,Marius Hobbhahn
発行日	2024-05-20 16:47:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー