Refusal in LLMs is an Affine Function

要約

アクティベーションに直接介入することにより、言語モデルの動作を操縦するためのアプローチとして、アフィンコンセプト編集（ACE）を提案します。
モデル活性化ベクトルのアフィン分解から始め、ステアリングモデルの動作の以前の方法は、この分解の用語のサブセットに対応することを示します。
次に、ACEの導出を提供し、Llama 3 70bを含む10の異なるモデルで拒否行動を制御するために使用します。
ACEは、アフィンサブスペースの投影とアクティベーションの追加を組み合わせて、プロンプトタイプにわたるモデルの拒否応答を確実に制御します。
有害で無害なプロンプトのコレクションでLLMベースのスコアリングを使用して結果を評価します。
我々の実験は、ACEが既存の方法よりもモデルの動作を一貫してより正確に制御し、アフィンサブスペース投影だけを介した方向アブレーションが一貫性のない出力を生成するモデルに一般化することを示しています。
結果を再現するためのコードは、https：//github.com/eleutherai/steering-llama3で入手できます。

要約(オリジナル)

We propose affine concept editing (ACE) as an approach for steering language models’ behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model’s refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at https://github.com/EleutherAI/steering-llama3 .

arxiv情報

著者	Thomas Marshall,Adam Scherlis,Nora Belrose
発行日	2025-01-28 03:59:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Refusal in LLMs is an Affine Function

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー