Tracr-Injection: Distilling Algorithms into Pre-trained Language Models

要約

大規模な言語モデルの急増に動機付けられているため、トランスアーキテクチャに固有の象徴的な能力を正式に特徴づけることが推進されてきました。
Raspと呼ばれるプログラミング言語が提案されており、これらのアルゴリズムを実装するためにトランスウェイトに直接コンパイルできます。
ただし、Raspで実装できるタスクは、自然な監視されていないデータから学習することはまれであり、変圧器アーキテクチャの理論的能力と、監視されていないデータからのこれらの機能の実用的な学習可能性を示すことはまれです。
Raspで記述されたアルゴリズムを事前訓練を受けた言語モデルに直接蒸留できるようにする方法を提案します。
3つの異なるアルゴリズムを言語モデルに注入することにより、方法を紹介します。
モデルの残留ストリーム内にメソッドが解釈可能な部分空間を作成する方法を示します。これは、RASPアルゴリズムのコードに存在する変数にデコードできます。
さらに、提案された方法は、私たちのベースラインと比較して、分散型のパフォーマンスを改善できることがわかりました。これは、実際にモデルの内側の仕組みでより象徴的なメカニズムが起こっていることを示しています。
実験を実行するために使用されるコードをリリースします。

要約(オリジナル)

Motivated by the surge of large language models, there has been a push to formally characterize the symbolic abilities intrinsic to the transformer architecture. A programming language, called RASP, has been proposed, which can be directly compiled into transformer weights to implement these algorithms. However, the tasks that can be implemented in RASP are often uncommon to learn from natural unsupervised data, showing a mismatch between theoretical capabilities of the transformer architecture, and the practical learnability of these capabilities from unsupervised data. We propose tracr-injection, a method that allows us to distill algorithms written in RASP directly into a pre-trained language model. We showcase our method by injecting 3 different algorithms into a language model. We show how our method creates an interpretable subspace within the model’s residual stream, which can be decoded into the variables present in the code of the RASP algorithm. Additionally, we found that the proposed method can improve out-of-distribution performance compared to our baseline, indicating that indeed a more symbolic mechanism is taking place in the inner workings of the model. We release the code used to run our experiments.

arxiv情報

著者	Tomás Vergara-Browne,Álvaro Soto
発行日	2025-05-19 16:06:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tracr-Injection: Distilling Algorithms into Pre-trained Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー