Understanding (Un)Reliability of Steering Vectors in Language Models

要約

ステアリングベクターは、推論時間に活性化に学習バイアスを追加することにより、言語モデルの動作を制御する軽量な方法です。
ステアリングは有望なパフォーマンスを示していますが、最近の研究では、場合によっては信頼できないか、逆効果になる可能性があることが示されています。
このペーパーでは、迅速なタイプの影響と、ステアリングの信頼性に対する活性化の違いのジオメトリを研究します。
まず、実験で使用される7つのプロンプトタイプはすべて、正味の正のステアリング効果を生み出しますが、サンプル全体で高い分散を示し、しばしば目的のものとは反対の効果をもたらすことがわかります。
プロンプトタイプは明らかに他のものを上回ることはありませんが、さまざまなプロンプトタイプから生じるステアリングベクターは、しばしば方向性が異なります（コサインの類似性で測定されます）。
第二に、トレーニングセットのアクティベーションの違いの間のコサインの類似性が高いと、より効果的なステアリングが予測されることを示します。
最後に、正と負の活性化がより適切に分離されているデータセットがより操縦可能であることがわかります。
我々の結果は、ターゲットの動作がコヒーレントな方向で表されない場合、ベクトルステアリングが信頼できないことを示唆しています。

要約(オリジナル)

Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.

arxiv情報

著者	Joschka Braun,Carsten Eickhoff,David Krueger,Seyed Ali Bahrainian,Dmitrii Krasheninnikov
発行日	2025-05-28 17:53:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding (Un)Reliability of Steering Vectors in Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー