Can sparse autoencoders be used to decompose and interpret steering vectors?

要約

ステアリングベクトルは、大規模な言語モデルの動作を制御するための有望なアプローチです。
しかし、その根底にあるメカニズムは依然としてよく理解されていません。
スパースオートエンコーダ (SAE) はステアリングベクトルを解釈する潜在的な方法を提供する可能性がありますが、最近の調査結果では、SAE によって再構築されたベクトルには元のベクトルのステアリング特性が欠けていることが多いことが示されています。
この論文では、SAE をステアリングベクトルに直接適用すると誤解を招く分解が生じる理由を調査し、(1) ステアリングベクトルが SAE の設計対象となる入力分布から外れる、(2) ステアリングベクトルが特徴方向に意味のある負の射影を持つ可能性がある、という 2 つの理由を特定します。
SAE はそれに対応するように設計されていません。
これらの制限により、ステアリングベクトルを解釈するために SAE を直接使用することが妨げられます。

要約(オリジナル)

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

arxiv情報

著者	Harry Mayne,Yushi Yang,Adam Mahdi
発行日	2024-11-13 17:16:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can sparse autoencoders be used to decompose and interpret steering vectors?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー