I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

要約

大規模な言語モデル（LLM）は、自然言語処理で顕著な成功を収めています。
最近の進歩により、新しいクラスの推論LLMが発展するようになりました。
たとえば、オープンソースのdeepseek-R1は、深い思考と複雑な推論を統合することにより、最先端のパフォーマンスを達成しました。
これらの印象的な能力にもかかわらず、そのようなモデルの内部推論メカニズムは未開拓のままです。
この作業では、Sparse Autoencoders（SAE）を採用しています。これは、神経ネットワークの潜在表現のまばらな分解を解釈可能な特徴に採用し、DeepSeek-R1シリーズのモデルで推論を促進する機能を特定します。
まず、SAE表現から候補者「推論機能」を抽出するアプローチを提案します。
経験的分析と解釈可能性の方法を通じてこれらの機能を検証し、モデルの推論能力との直接的な相関を示します。
重要なことに、これらの機能をステアリングすることで、LLMSの推論に関する最初の機械的説明を提供することが推論パフォーマンスを体系的に向上させることを実証します。
https://github.com/airi-institute/sae-rasoningで入手可能なコード

要約(オリジナル)

Large Language Models (LLMs) have achieved remarkable success in natural language processing. Recent advances have led to the developing of a new class of reasoning LLMs; for example, open-source DeepSeek-R1 has achieved state-of-the-art performance by integrating deep thinking and complex reasoning. Despite these impressive capabilities, the internal reasoning mechanisms of such models remain unexplored. In this work, we employ Sparse Autoencoders (SAEs), a method to learn a sparse decomposition of latent representations of a neural network into interpretable features, to identify features that drive reasoning in the DeepSeek-R1 series of models. First, we propose an approach to extract candidate ”reasoning features” from SAE representations. We validate these features through empirical analysis and interpretability methods, demonstrating their direct correlation with the model’s reasoning abilities. Crucially, we demonstrate that steering these features systematically enhances reasoning performance, offering the first mechanistic account of reasoning in LLMs. Code available at https://github.com/AIRI-Institute/SAE-Reasoning

arxiv情報

著者	Andrey Galichin,Alexey Dontsov,Polina Druzhinina,Anton Razzhigaev,Oleg Y. Rogov,Elena Tutubalina,Ivan Oseledets
発行日	2025-03-24 16:54:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

I Have Covered All the Bases Here: Interpreting Reasoning Features in Large Language Models via Sparse Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー