Interpretable Steering of Large Language Models with Feature Guided Activation Additions

要約

大規模な言語モデル（LLM）の動作に対する効果的かつ信頼できる制御は、重要な課題です。
モデルの隠された状態にステアリングベクトルを追加するアクティベーションステアリング方法は有望なアプローチですが、既存の手法は、モデル出力にどのように影響するかに精度と解釈性を欠いていることがよくあります。
Contrastive Activationの追加（CAA）およびスパースオートエンコーダーターゲットステアリング（SAE-TS）から洞察を活用する新しいアクティベーションステアリング方法である機能ガイド付きアクティベーション追加（FGAA）を紹介します。
スパースオートエンコーダー（SAE）の潜在スペースを操作し、最適化技術を使用して目的のSAE機能を選択することにより、FGAAは、ステアリングモデル出力のコヒーレンスを維持しながらより良いステアリング効果を提供する正確なステアリングベクターを構築します。
この点で、さまざまなステアリングタスクにわたるGEMMA-2-2BおよびGEMMA-2-9Bモデルの評価は、FGAAがCAA、SAEデコーダーステアリング、およびSAE-TSの既存のステアリング方法を上回ることを示しています。
また、私たちの結果は、ステアリングスケールと、テストされたすべてのステアリング方法で一貫している一般的なモデル機能との間の重要なトレードオフを強調しています。

要約(オリジナル)

Effective and reliable control over large language model (LLM) behavior is a significant challenge. While activation steering methods, which add steering vectors to a model’s hidden states, are a promising approach, existing techniques often lack precision and interpretability in how they influence model outputs. We introduce Feature Guided Activation Additions (FGAA), a novel activation steering method that leverages insights from Contrastive Activation Addition (CAA) and Sparse Autoencoder-Targeted Steering (SAE-TS). By operating in the latent space of a Sparse Autoencoder (SAE) and employing optimization techniques to select desired SAE features, FGAA constructs precise steering vectors that provide better steering effects while maintaining coherence of steered model outputs. In this regard, evaluations on Gemma-2-2B and Gemma-2-9B models across various steering tasks demonstrate that FGAA outperforms existing steering methods of CAA, SAE decoder steering, and SAE-TS. Our results also highlight important trade-offs between steering scale and general model capabilities that are consistent across all tested steering methods.

arxiv情報

著者	Samuel Soo,Chen Guang,Wesley Teng,Chandrasekaran Balaganesh,Tan Guoxian,Yan Ming
発行日	2025-04-02 13:20:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpretable Steering of Large Language Models with Feature Guided Activation Additions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー