Localizing Model Behavior with Path Patching

要約

ニューラルネットワークの動作をネットワークコンポーネントのサブセットまたはコンポーネント間の相互作用のサブセットに局所化することは、ネットワークメカニズムと考えられる障害モードを分析するための自然な第一歩です。
既存の研究は定性的かつ場当たり的なものが多く、ローカリゼーションの主張を評価する適切な方法についてはコンセンサスがありません。
パスパッチングを導入します。これは、動作が一連のパスに局所化されているという仮説の自然なクラスを表現し、定量的にテストするための手法です。
私たちは誘導ヘッドの説明を改良し、GPT-2 の動作を特徴付け、同様の実験を効率的に実行するためのフレームワークをオープンソースします。

要約(オリジナル)

Localizing behaviors of neural networks to a subset of the network’s components or a subset of interactions between components is a natural first step towards analyzing network mechanisms and possible failure modes. Existing work is often qualitative and ad-hoc, and there is no consensus on the appropriate way to evaluate localization claims. We introduce path patching, a technique for expressing and quantitatively testing a natural class of hypotheses expressing that behaviors are localized to a set of paths. We refine an explanation of induction heads, characterize a behavior of GPT-2, and open source a framework for efficiently running similar experiments.

arxiv情報

著者	Nicholas Goldowsky-Dill,Chris MacLeod,Lucas Sato,Aryaman Arora
発行日	2023-05-16 16:24:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Localizing Model Behavior with Path Patching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー