Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

要約

大規模な言語モデル（LLM）の多言語機能の背後にあるメカニズムは、ニューロンベースまたは内部活性化ベースの方法を使用して調べられています。
ただし、これらの方法は、多くの場合、重ね合わせや層ごとの活性化分散などの課題に直面しており、信頼性を制限します。
Sparse Autoencoders（SAE）は、LLMの活性化をSAE機能のスパースリニア組み合わせに分解することにより、より微妙な分析を提供します。
SAEから得られた特徴の単一言語性を評価するための新しいメトリックを導入し、一部の機能が特定の言語に強く関連していることを発見します。
さらに、これらのSAE機能を除去すると、LLMの1つの言語で能力が大幅に低下し、他の言語がほとんど影響を受けないことを示しています。
興味深いことに、いくつかの言語には複数の相乗的なSAE機能があることがわかり、それらを除去すると、個別にアブレーションするよりも大きな改善が得られます。
さらに、これらのSAE由来の言語固有の機能を活用して、ステアリングベクターを強化し、LLMSによって生成された言語を制御します。

要約(オリジナル)

The mechanisms behind multilingual capabilities in Large Language Models (LLMs) have been examined using neuron-based or internal-activation-based methods. However, these methods often face challenges such as superposition and layer-wise activation variance, which limit their reliability. Sparse Autoencoders (SAEs) offer a more nuanced analysis by decomposing the activations of LLMs into sparse linear combination of SAE features. We introduce a novel metric to assess the monolinguality of features obtained from SAEs, discovering that some features are strongly related to specific languages. Additionally, we show that ablating these SAE features only significantly reduces abilities in one language of LLMs, leaving others almost unaffected. Interestingly, we find some languages have multiple synergistic SAE features, and ablating them together yields greater improvement than ablating individually. Moreover, we leverage these SAE-derived language-specific features to enhance steering vectors, achieving control over the language generated by LLMs.

arxiv情報

著者	Boyi Deng,Yu Wan,Yidan Zhang,Baosong Yang,Fuli Feng
発行日	2025-05-08 10:24:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー