Successor Heads: Recurring, Interpretable Attention Heads In The Wild

要約

この研究では、後続ヘッド、つまり数字、月、日などの自然な順序でトークンを増加させるアテンションヘッドを紹介します。
たとえば、後継ヘッドは「月曜日」を「火曜日」に増分します。
私たちは、モデルが人間に理解できる言葉でタスクを完了する方法を説明することを目的とした分野である機械的解釈可能性に根ざしたアプローチを使用して、後継ヘッドの動作を説明します。
この分野の既存の研究では、小さなおもちゃのモデルで解釈可能な言語モデルのコンポーネントが発見されています。
ただし、おもちゃのモデルでの結果は、フロンティアモデルの内部を説明する洞察にまだつながっておらず、大規模な言語モデルの内部操作については現在ほとんど理解されていません。
この論文では、大規模言語モデル (LLM) における後継ヘッドの動作を分析し、それらが異なるアーキテクチャに共通する抽象表現を実装していることを発見しました。
これらは、GPT-2、Pythia、Llama-2 など、最小で 3,100 万のパラメーター、少なくとも最大で 120 億のパラメーターを持つ LLM で形成されます。
私たちは、さまざまなアーキテクチャやサイズにわたって後継ヘッドの LLM が増加する仕組みの基礎となる一連の「mod-10 機能」を発見しました。
これらの機能を使用してベクトル演算を実行して、ヘッドの動作を編集し、LLM 内の数値表現についての洞察を提供します。
さらに、自然言語データ上の後継ヘッドの動作を研究し、Pythia 後継ヘッドの解釈可能な多意味性を特定します。

要約(オリジナル)

In this work we present successor heads: attention heads that increment tokens with a natural ordering, such as numbers, months, and days. For example, successor heads increment ‘Monday’ into ‘Tuesday’. We explain the successor head behavior with an approach rooted in mechanistic interpretability, the field that aims to explain how models complete tasks in human-understandable terms. Existing research in this area has found interpretable language model components in small toy models. However, results in toy models have not yet led to insights that explain the internals of frontier models and little is currently understood about the internal operations of large language models. In this paper, we analyze the behavior of successor heads in large language models (LLMs) and find that they implement abstract representations that are common to different architectures. They form in LLMs with as few as 31 million parameters, and at least as many as 12 billion parameters, such as GPT-2, Pythia, and Llama-2. We find a set of ‘mod-10 features’ that underlie how successor heads increment in LLMs across different architectures and sizes. We perform vector arithmetic with these features to edit head behavior and provide insights into numeric representations within LLMs. Additionally, we study the behavior of successor heads on natural language data, identifying interpretable polysemanticity in a Pythia successor head.

arxiv情報

著者	Rhys Gould,Euan Ong,George Ogden,Arthur Conmy
発行日	2023-12-14 18:55:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Successor Heads: Recurring, Interpretable Attention Heads In The Wild

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー