InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

要約

機械的解釈可能性手法は、ニューラルネットワークが実装するアルゴリズムを特定することを目的としていますが、真のアルゴリズムが不明な場合、そのような手法を検証することは困難です。
この研究では、これらの技術を評価するための既知の回路を備えた、半合成ながら現実的なトランスフォーマーのコレクションである InterpBench を紹介します。
これらのニューラルネットワークは、Strict IIT (SIIT) と呼ばれる、より厳密なバージョンの Interchange Intervention Training (IIT) を使用してトレーニングされます。
オリジナルと同様、SIIT は内部計算を目的の高レベル因果モデルに合わせてニューラルネットワークをトレーニングしますが、非回路ノードがモデルの出力に影響を与えることも防ぎます。
Tracr ツールによって生成されたスパーストランスの SIIT を評価したところ、SIIT モデルは Tracr の元の回路を維持しながら、より現実的であることがわかりました。
SIIT は、間接オブジェクト識別 (IOI) などの大規模な回路を使用して変圧器をトレーニングすることもできます。
最後に、ベンチマークを使用して、既存の回路検出手法を評価します。

要約(オリジナル)

Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train these neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model’s output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr’s original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.

arxiv情報

著者	Rohan Gupta,Iván Arcuschin,Thomas Kwa,Adrià Garriga-Alonso
発行日	2024-07-19 17:46:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー