Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

要約

スパース自動エンコーダー（SAE）は、ネットワークの内部アクティベーションのまばらで過剰な分解を学習することにより、ニューラルネットワーク表現を抽出するための有望なアプローチです。
ただし、SAEは伝統的にアクティベーション値のみを考慮して訓練されており、これらのアクティベーションが下流の計算に与える影響ではありません。
これにより、機能を学習するために利用可能な情報が制限され、アクティベーション値が小さく、モデル出力に強く影響する機能を無視することに自動エンコーダーにバイアスをかけます。
これに対処するために、$ K $要素を選択するときに入力アクティベーションの勾配に依存するようにTOPKアクティベーション関数を増強することにより、$ K $ -SPARSEオートエンコーダーアーキテクチャを変更するグラデーションSAE（G-SAES）を導入します。
特定のスパースレベルでは、G-SAESは、ネットワークを介して伝播すると、元のネットワークパフォーマンスにより忠実な再構成を生成します。
さらに、G-saesが任意のコンテキストでのステアリングモデルで平均してより効果的な潜在性を学習するという証拠が見つかります。
アクティベーションのダウンストリーム効果を考慮することにより、私たちのアプローチは、ニューラルネットワーク機能の二重の性質を、$ \ textit {sperplations} $、retrospectivilly、および$ \ textit {actions} $として前向きに活用します。
以前の方法は、主に前の側面に焦点を当てた機能の発見の問題に近づいてきましたが、G-Saesも後者の会計に向けたステップを表しています。

要約(オリジナル)

Sparse Autoencoders (SAEs) are a promising approach for extracting neural network representations by learning a sparse and overcomplete decomposition of the network’s internal activations. However, SAEs are traditionally trained considering only activation values and not the effect those activations have on downstream computations. This limits the information available to learn features, and biases the autoencoder towards neglecting features which are represented with small activation values but strongly influence model outputs. To address this, we introduce Gradient SAEs (g-SAEs), which modify the $k$-sparse autoencoder architecture by augmenting the TopK activation function to rely on the gradients of the input activation when selecting the $k$ elements. For a given sparsity level, g-SAEs produce reconstructions that are more faithful to original network performance when propagated through the network. Additionally, we find evidence that g-SAEs learn latents that are on average more effective at steering models in arbitrary contexts. By considering the downstream effects of activations, our approach leverages the dual nature of neural network features as both $\textit{representations}$, retrospectively, and $\textit{actions}$, prospectively. While previous methods have approached the problem of feature discovery primarily focused on the former aspect, g-SAEs represent a step towards accounting for the latter as well.

arxiv情報

著者	Jeffrey Olmo,Jared Wilson,Max Forsey,Bryce Hepner,Thomas Vin Howe,David Wingate
発行日	2025-03-31 20:36:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Features that Make a Difference: Leveraging Gradients for Improved Dictionary Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー