Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

要約

機械的解釈可能性は、機械学習モデルの内部メカニズムを理解することを目指しており、重要なモデルコンポーネントを特定するローカリゼーションが重要なステップとなります。
因果追跡または交換介入としても知られるアクティベーションパッチは、このタスクの標準的な手法です (Vig et al., 2020)。しかし、文献には、ハイパーパラメーターや方法論の選択についてほとんど合意が得られていない多くのバリアントが含まれています。
この研究では、評価指標や破損方法など、アクティベーションパッチ適用における方法論の詳細の影響を系統的に検証します。
言語モデルにおけるローカリゼーションと回路発見のいくつかの設定では、これらのハイパーパラメータを変更すると、異なる解釈結果が得られる可能性があることがわかりました。
経験的な観察に裏付けられ、なぜ特定の指標や手法が好まれるのかについて概念的な議論を提供します。
最後に、今後のアクティベーションパッチ適用のベストプラクティスに関する推奨事項を提供します。

要約(オリジナル)

Mechanistic interpretability seeks to understand the internal mechanisms of machine learning models, where localization — identifying the important model components — is a key step. Activation patching, also known as causal tracing or interchange intervention, is a standard technique for this task (Vig et al., 2020), but the literature contains many variants with little consensus on the choice of hyperparameters or methodology. In this work, we systematically examine the impact of methodological details in activation patching, including evaluation metrics and corruption methods. In several settings of localization and circuit discovery in language models, we find that varying these hyperparameters could lead to disparate interpretability results. Backed by empirical observations, we give conceptual arguments for why certain metrics or methods may be preferred. Finally, we provide recommendations for the best practices of activation patching going forwards.

arxiv情報

著者	Fred Zhang,Neel Nanda
発行日	2024-01-17 04:07:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Best Practices of Activation Patching in Language Models: Metrics and Methods

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー