Tracking the Feature Dynamics in LLM Training: A Mechanistic Study

要約

トレーニングのダイナミクスと機能の進化を理解することは、大規模言語モデル (LLM) の機構的な解釈可能性にとって重要です。
LLM 内の特徴を識別するためにスパースオートエンコーダー (SAE) が使用されてきましたが、これらの特徴がトレーニング中にどのように進化するのかを明確に把握することは依然として困難です。
この研究では、(1) 連続的な一連の SAE を効率的に取得する方法である SAE-Track を導入します。
(2) 特徴形成プロセスを定式化し、機構分析を実施します。
(3) トレーニング中の特徴ドリフトを分析して視覚化します。
私たちの研究は、LLM の機能のダイナミクスに関する新たな洞察を提供し、トレーニングメカニズムと機能の進化についての理解を深めます。

要約(オリジナル)

Understanding training dynamics and feature evolution is crucial for the mechanistic interpretability of large language models (LLMs). Although sparse autoencoders (SAEs) have been used to identify features within LLMs, a clear picture of how these features evolve during training remains elusive. In this study, we: (1) introduce SAE-Track, a method to efficiently obtain a continual series of SAEs; (2) formulate the process of feature formation and conduct a mechanistic analysis; and (3) analyze and visualize feature drift during training. Our work provides new insights into the dynamics of features in LLMs, enhancing our understanding of training mechanisms and feature evolution.

arxiv情報

著者	Yang Xu,Yi Wang,Hao Wang
発行日	2024-12-23 14:58:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tracking the Feature Dynamics in LLM Training: A Mechanistic Study

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー