Tokenized SAEs: Disentangling SAE Reconstructions

要約

スパース自動エンコーダー（SAE）は、言語モデルの内部作業を解釈するための一般的なツールになりました。
ただし、SAE機能がモデルの計算的に重要な方向にどれほど厳密に対応するかは不明です。
この作業は、多くのRES-JB SAEが主に単純な入力統計に対応していることを経験的に示しています。
これは、複雑なエラー信号の欠如と組み合わされたトレーニングデータの大規模なクラスの不均衡が原因であると仮定します。
この動作を減らすために、機能の再構築からトークン再構成を解き放つ方法を提案します。
この改善は、興味深い再構築のための強化されたベースラインを提供するトークあたりのバイアスを導入することで達成されます。
その結果、まばらなレジームでの非常に興味深い機能と改善された再構築が学習されます。

要約(オリジナル)

Sparse auto-encoders (SAEs) have become a prevalent tool for interpreting language models’ inner workings. However, it is unknown how tightly SAE features correspond to computationally important directions in the model. This work empirically shows that many RES-JB SAE features predominantly correspond to simple input statistics. We hypothesize this is caused by a large class imbalance in training data combined with a lack of complex error signals. To reduce this behavior, we propose a method that disentangles token reconstruction from feature reconstruction. This improvement is achieved by introducing a per-token bias, which provides an enhanced baseline for interesting reconstruction. As a result, significantly more interesting features and improved reconstruction in sparse regimes are learned.

arxiv情報

著者	Thomas Dooms,Daniel Wilhelm
発行日	2025-02-24 17:04:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tokenized SAEs: Disentangling SAE Reconstructions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー