Analyzing (In)Abilities of SAEs via Formal Languages

要約

オートエンコーダは、画像ドメインとテキストドメインの両方で、ニューラルネットワーク表現の基礎となる、解釈可能でもつれの解けた特徴を見つけるために使用されてきました。
このような方法の有効性と落とし穴は視覚分野ではよく研究されていますが、テキスト領域では定性的および定量的の両方で対応する結果が不足しています。
私たちは、形式言語の合成テストベッドでスパースオートエンコーダー (SAE) をトレーニングすることで、このギャップに対処することを目指しています。
具体的には、形式言語 (Dyck-2、Expr、および英語 PCFG) でトレーニングされたモデルの隠れた表現について、さまざまなハイパーパラメーター設定の下で SAE をトレーニングし、SAE によって学習された特徴に解釈可能な潜在性が頻繁に出現することを発見しました。
ただし、視覚と同様に、パフォーマンスはトレーニングパイプラインの誘導バイアスに非常に敏感であることがわかりました。
さらに、入力の特定の特徴に相関する潜在力が、必ずしもモデルの計算に因果関係を引き起こすわけではないことを示します。
したがって、我々は、因果関係が SAE トレーニングの中心的な目標になる必要がある、つまり因果関係の特徴の学習が根本から奨励されるべきであると主張します。
これを動機として、私たちは形式的な言語設定において因果関係のある特徴の学習を促進するアプローチの予備調査を提案し、実行します。

要約(オリジナル)

Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do not always induce a causal impact on model’s computation. We thus argue that causality has to become a central target in SAE training: learning of causal features should be incentivized from the ground-up. Motivated by this, we propose and perform preliminary investigations for an approach that promotes learning of causally relevant features in our formal language setting.

arxiv情報

著者	Abhinav Menon,Manish Shrivastava,David Krueger,Ekdeep Singh Lubana
発行日	2024-10-15 16:42:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Analyzing (In)Abilities of SAEs via Formal Languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー