Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

要約

参照の乱用や比phorの認識/生成など、複雑な言語能力を必要とするタスクには、大規模な言語モデル（LLM）が優れています。
LLMは印象的な能力を持っていますが、言語知識を処理および表現するための内部メカニズムは、主に不透明なままです。
言語メカニズムに関する以前の研究は、粗い粒度、不十分な因果分析、および狭い焦点によって制限されています。
この研究では、スパース自動エンコーダー（SAE）を使用した体系的で包括的な因果調査を提示します。
音声学、音韻、形態学、構文、セマンティクス、プラグマティクスの6つの次元から幅広い言語特徴を抽出します。
最小限のコントラストデータセットと反事実的な文データセットを構築することにより、これらの機能を抽出、評価、介入します。
2つのインデックスフィーチャー表現信頼性（FRC）と特徴介入信頼性（FIC）を導入し、言語特徴が言語現象をキャプチャおよび制御する能力を測定します。
私たちの結果は、LLMSにおける言語知識の固有の表現を明らかにし、モデル出力を制御する可能性を示しています。
この研究は、LLMが本物の言語知識を持っているという強力な証拠を提供し、将来の研究でより解釈可能で制御可能な言語モデリングの基礎を築きます。

要約(オリジナル)

Large language models (LLMs) excel in tasks that require complex linguistic abilities, such as reference disambiguation and metaphor recognition/generation. Although LLMs possess impressive capabilities, their internal mechanisms for processing and representing linguistic knowledge remain largely opaque. Previous work on linguistic mechanisms has been limited by coarse granularity, insufficient causal analysis, and a narrow focus. In this study, we present a systematic and comprehensive causal investigation using sparse auto-encoders (SAEs). We extract a wide range of linguistic features from six dimensions: phonetics, phonology, morphology, syntax, semantics, and pragmatics. We extract, evaluate, and intervene on these features by constructing minimal contrast datasets and counterfactual sentence datasets. We introduce two indices-Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)-to measure the ability of linguistic features to capture and control linguistic phenomena. Our results reveal inherent representations of linguistic knowledge in LLMs and demonstrate the potential for controlling model outputs. This work provides strong evidence that LLMs possess genuine linguistic knowledge and lays the foundation for more interpretable and controllable language modeling in future research.

arxiv情報

著者	Yi Jing,Zijun Yao,Lingxu Ran,Hongzhu Guo,Xiaozhi Wang,Lei Hou,Juanzi Li
発行日	2025-02-27 18:16:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー