Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

要約

大規模な言語モデル（LLM）は、人間のクエリの取り扱いに優れていますが、時々欠陥または予期しない応答を生成することができます。
彼らの内部状態を理解することは、彼らの成功を理解し、障害を診断し、能力を改善するために重要です。
スパース自動エンコーダー（SAE）はLLM内部表現を解釈することを約束していますが、限られた研究では、SAE機能をよりよく説明する方法、つまりSAEが学んだ機能の意味的な意味を理解する方法を探りました。
私たちの理論分析は、既存の説明方法が周波数バイアスの問題に苦しんでおり、セマンティックの概念よりも言語パターンを強調していることが明らかになり、後者はLLMの動作を操縦するにはより重要です。
これに対処するために、これらの機能の背後にある意味的な意味をより適切にキャプチャすることを目的とした、機能の解釈と相互情報ベースの目標を設計するための固定語彙セットを使用して提案します。
さらに、対応する説明に基づいて学習された機能のアクティブ化を調整する2つのランタイムステアリング戦略を提案します。
経験的結果は、ベースラインと比較して、私たちの方法はより多くの談話レベルの説明を提供し、LLMの行動を効果的に操縦して脱獄攻撃を防御することを示しています。
これらの調査結果は、ダウンストリームアプリケーションでのLLM行動を操縦するための説明の価値を強調しています。
受け入れられたら、コードとデータをリリースします。

要約(オリジナル)

Large language models (LLMs) excel at handling human queries, but they can occasionally generate flawed or unexpected responses. Understanding their internal states is crucial for understanding their successes, diagnosing their failures, and refining their capabilities. Although sparse autoencoders (SAEs) have shown promise for interpreting LLM internal representations, limited research has explored how to better explain SAE features, i.e., understanding the semantic meaning of features learned by SAE. Our theoretical analysis reveals that existing explanation methods suffer from the frequency bias issue, where they emphasize linguistic patterns over semantic concepts, while the latter is more critical to steer LLM behaviors. To address this, we propose using a fixed vocabulary set for feature interpretations and designing a mutual information-based objective, aiming to better capture the semantic meaning behind these features. We further propose two runtime steering strategies that adjust the learned feature activations based on their corresponding explanations. Empirical results show that, compared to baselines, our method provides more discourse-level explanations and effectively steers LLM behaviors to defend against jailbreak attacks. These findings highlight the value of explanations for steering LLM behaviors in downstream applications. We will release our code and data once accepted.

arxiv情報

著者	Xuansheng Wu,Jiayi Yuan,Wenlin Yao,Xiaoming Zhai,Ninghao Liu
発行日	2025-02-21 16:36:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpreting and Steering LLMs with Mutual Information-based Explanations on Sparse Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー