Scaling sparse feature circuit finding for in-context learning

要約

スパース自動エンコーダー（SAE）は、大規模な言語モデルのアクティベーションを解釈するための人気のあるツールですが、解釈可能性のオープンな質問に対処する有用性は不明のままです。
この作業では、SAEを使用して、コンテキスト学習（ICL）の背後にあるメカニズムの理解を深めることにより、それらの有効性を示します。
（i）実行するタスクに関するモデルの知識と（ii）潜在的なベクトルが因果的にタスクをゼロショットに誘導するという抽象的なsae機能を特定します。
これは、ICLがタスクベクトルによって媒介されることを示す以前の作業と一致します。
さらに、これらのタスクベクターは、これらのタスクと解釈の特徴を含むSAE潜伏物のまばらな合計によってよく近似されていることを実証します。
ICLメカニズムを調査するために、Marks et alのスパース機能回路方法論を適応させます。
（2024）はるかに大きなGemma-1 2Bモデルで働くこと、30倍のパラメーターを備えた、およびICLのより複雑なタスク。
回路の発見を通じて、タスクが実行されたときに検出するプロンプトの早い段階でアクティブ化する対応するSAE潜水具を持つタスク検出機能を発見します。
それらは、注意とMLPサブレイヤーを通じて、タスクと解釈の特徴と因果関係があります。

要約(オリジナル)

Sparse autoencoders (SAEs) are a popular tool for interpreting large language model activations, but their utility in addressing open questions in interpretability remains unclear. In this work, we demonstrate their effectiveness by using SAEs to deepen our understanding of the mechanism behind in-context learning (ICL). We identify abstract SAE features that (i) encode the model’s knowledge of which task to execute and (ii) whose latent vectors causally induce the task zero-shot. This aligns with prior work showing that ICL is mediated by task vectors. We further demonstrate that these task vectors are well approximated by a sparse sum of SAE latents, including these task-execution features. To explore the ICL mechanism, we adapt the sparse feature circuits methodology of Marks et al. (2024) to work for the much larger Gemma-1 2B model, with 30 times as many parameters, and to the more complex task of ICL. Through circuit finding, we discover task-detecting features with corresponding SAE latents that activate earlier in the prompt, that detect when tasks have been performed. They are causally linked with task-execution features through the attention and MLP sublayers.

arxiv情報

著者	Dmitrii Kharlapenko,Stepan Shabalin,Fazl Barez,Arthur Conmy,Neel Nanda
発行日	2025-04-18 15:45:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling sparse feature circuit finding for in-context learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー