InversionView: A General-Purpose Method for Reading Information from Neural Activations

要約

ニューラル活性化にエンコードされた情報を完全に解読できれば、ニューラルネットワークの内部動作をよりよく理解できるようになります。
この論文では、この情報は同様の活性化を引き起こす入力のサブセットによって具体化されると主張します。
入力空間が指数関数的に大きいため、このようなサブセットの計算は自明ではありません。
私たちは InversionView を提案します。これにより、アクティベーションを条件としてトレーニングされたデコーダーモデルからサンプリングすることで、このサブセットを実際に検査できるようになります。
これは、アクティベーションベクトルの情報内容を明らかにするのに役立ち、トランスフォーマーモデルによって実装されるアルゴリズムの理解を容易にします。
小型変圧器から GPT-2 までのモデルを調査する 4 つのケーススタディを紹介します。
これらの研究では、私たちの方法の特徴を実証し、それが提供する独特の利点を示し、因果関係が検証された回路を提供します。

要約(オリジナル)

The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. Computing such subsets is nontrivial as the input space is exponentially large. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we demonstrate the characteristics of our method, show the distinctive advantages it offers, and provide causally verified circuits.

arxiv情報

著者	Xinting Huang,Madhur Panwar,Navin Goyal,Michael Hahn
発行日	2024-07-15 13:30:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InversionView: A General-Purpose Method for Reading Information from Neural Activations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー