Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

要約

スパースオートエンコーダ (SAE) は、ネットワークの内部活性化のスパースで過剰な分解を学習することにより、ニューラルネットワーク表現を抽出するための有望なアプローチです。
ただし、SAE は従来、アクティベーション値のみを考慮してトレーニングされており、アクティベーションが下流の計算に与える影響は考慮されていません。
これにより、特徴の学習に利用できる情報が制限され、小さなアクティベーション値で表されるがモデルの出力に大きな影響を与える特徴を無視する方向にオートエンコーダーが偏ることになります。
これに対処するために、$k$ 要素を選択するときに入力アクティベーションの勾配に依存するように TopK アクティベーション関数を強化することで、$k$-sparse オートエンコーダアーキテクチャを変更する Gradient SAE (g-SAE) を導入します。
特定のスパース性レベルに対して、g-SAE は、ネットワークを介して伝播されるときに、元のネットワークパフォーマンスにより忠実な再構築を生成します。
さらに、g-SAE は任意のコンテキストでモデルを操作する際に平均してより効果的な潜在力を学習するという証拠も見つかりました。
活性化の下流効果を考慮することで、私たちのアプローチは、$\textit{representations}$ (遡及的) と $\textit{actions}$ (前向き) の両方として、ニューラルネットワークの特徴の二重の性質を活用します。
これまでの手法は主に前者の側面に焦点を当てて特徴発見の問題に取り組んできましたが、g-SAE は後者の側面も考慮するための一歩となります。

要約(オリジナル)

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network’s internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.

arxiv情報

著者	Jeffrey Olmo,Jared Wilson,Max Forsey,Bryce Hepner,Thomas Vin Howe,David Wingate
発行日	2024-11-15 18:03:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー