Improving Steering Vectors by Targeting Sparse Autoencoder Features

要約

言語モデルの動作を制御するために、ステアリングメソッドは、モデルの出力が特定の事前定義されたプロパティを満たすことを保証しようとします。
ステアリングベクトルをモデルに追加することは、微調整よりも簡単で、プロンプトよりも堅牢な可能性があるモデル制御の有望な方法です。
しかし、CAA [Panickssery et al., 2024] や SAE 潜在物質の直接使用 [Templeton et al., 2024] などの方法によって生成されるステアリングベクトルの効果を予測することは困難な場合があります。
私たちの研究では、SAE を使用してステアリングベクトルの効果を測定することでこの問題に対処し、ステアリングベクトル介入の因果関係を理解するために使用できる方法を提供します。
私たちは、この方法を因果関係の測定に使用して、改良されたステアリング手法である SAE ターゲットステアリング (SAE-TS) を開発します。これは、意図しない副作用を最小限に抑えながら、特定の SAE 特徴をターゲットにするステアリングベクトルを見つけます。
さまざまなタスクで評価した場合、全体として、SAE-TS は CAA および SAE 機能ステアリングよりもステアリング効果と一貫性のバランスが優れていることを示します。

要約(オリジナル)

To control the behavior of language models, steering methods attempt to ensure that outputs of the model satisfy specific pre-defined properties. Adding steering vectors to the model is a promising method of model control that is easier than finetuning, and may be more robust than prompting. However, it can be difficult to anticipate the effects of steering vectors produced by methods such as CAA [Panickssery et al., 2024] or the direct use of SAE latents [Templeton et al., 2024]. In our work, we address this issue by using SAEs to measure the effects of steering vectors, giving us a method that can be used to understand the causal effect of any steering vector intervention. We use this method for measuring causal effects to develop an improved steering method, SAE-Targeted Steering (SAE-TS), which finds steering vectors to target specific SAE features while minimizing unintended side effects. We show that overall, SAE-TS balances steering effects with coherence better than CAA and SAE feature steering, when evaluated on a range of tasks.

arxiv情報

著者	Sviatoslav Chalnev,Matthew Siu,Arthur Conmy
発行日	2024-11-21 12:10:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Steering Vectors by Targeting Sparse Autoencoder Features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー