Fusion Steering: Prompt-Specific Activation Control

要約

質問を回答（QA）タスクの大規模な言語モデル（LLM）の事実上精度を向上させるアクティベーションステアリング方法論である融合ステアリングを提示します。
このアプローチでは、フルレイヤーステアリングやセグメント化されたステアリングなど、柔軟なステアリング構成を導入します。
単一層または固定層操作に制約されている従来の方法とは異なり、Fusionステアリングは、すべての変圧器層にわたって迅速な特異的活性化デルタの動的注入を採用しています。
これらの活性化デルタは、意味的に濃縮された例固有のステアリングを促進するために、地上の真実の答えとモデル生成の説明を組み合わせた参照完了から導き出されます。
注入重みは、オプトナを使用してプロンプトごとに最適化され、トークンのオーバーラップ（事実のアライメント）と困惑（流fluencyプロキシ）のバランスをとる共同目標をターゲットにしています。
評価では、トークンのオーバーラップとLLMグレードの品質を統合した複合スコアを採用し、事実上の正確性、一貫性、および関連性を網羅しています。
260のSimpleQAプロンプト（ベースラインが失敗した500から選択）の経験的結果は、セグメント化されたステアリングの有効性を示しています。
8ビット量子化でGemma-2-2B-ITを使用して、セグメント化されたステアリングは25.4％（$ \ geq 0.6 $のスコアがある出力）の精度を達成し、ベースラインを3.5％、フルレイヤーステアリングを16.2％で上回ります。
より厳しいSimpleQAルーブリックの下で、セグメント化されたステアリングブーストは、0.0％から13.1％の完全な応答を完全に修正します。
これらの発見は、セグメント化された動的介入戦略の強みと、序文ごとのフルネットワークの活性化制御の約束を強調しています。
融合ステアリングは、NeuronPediaやスパースクロスコダーなどのまばらな表現にも適しており、LLMSの解釈可能でスケーラブルな活性化レベルの制御の有望な方向を示唆しています。

要約(オリジナル)

We present Fusion Steering, an activation steering methodology that improves factual accuracy in large language models (LLMs) for question-answering (QA) tasks. This approach introduces flexible steering configurations, including full-layer steering and segmented steering. Unlike traditional methods constrained to single-layer or fixed-layer operations, Fusion Steering employs dynamic injection of prompt-specific activation deltas across all transformer layers. These activation deltas are derived from reference completions that combine the ground-truth answer with a model-generated explanation to facilitate semantically enriched, example-specific steering. The injection weights are optimized per prompt using Optuna, targeting a joint objective that balances token overlap (factual alignment) and perplexity (fluency proxy). Evaluation employs a composite score integrating token overlap and LLM-graded quality, encompassing factual accuracy, coherence, and relevance. Empirical results on 260 SimpleQA prompts (selected from 500 where the baseline failed) showcase the efficacy of segmented steering. Using Gemma-2-2B-IT with 8-bit quantization, segmented steering achieves an accuracy of 25.4% (outputs scoring $\geq 0.6$), outperforming the baseline at 3.5% and full-layer steering at 16.2%. Under the stricter SimpleQA rubric, segmented steering boosts fully correct responses from 0.0% to 13.1%. These findings highlight the strengths of segmented, dynamic intervention strategies and the promise of per-prompt, full-network activation control. Fusion Steering is also amenable to sparse representations, such as Neuronpedia or sparse crosscoders, suggesting a promising direction for interpretable and scalable activation-level control in LLMs.

arxiv情報

著者	Waldemar Chang,Alhassan Yasin
発行日	2025-05-28 16:46:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fusion Steering: Prompt-Specific Activation Control

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー