Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

要約

多層パーセプロン（MLP）は大規模な言語モデルの不可欠な部分ですが、それらの密な表現により、理解し、編集、操縦するのが難しくなります。
最近の方法は、ニューロンレベルのスパースを介して解釈可能な近似を学びますが、元のマッピングを忠実に再構築することはできません。
この論文では、まばらな層近似での精度のトレードオフを克服するために、層レベルのスパース性に移行することを提唱しています。
このパラダイムの下で、デコーダー（MXD）の混合物を導入します。
MXDSはMLPとゲートの線形ユニットを一般化し、事前に訓練された密な層を数万の特殊なサブレーヤーに拡張します。
柔軟な形式のテンソル因数分解を通して、それぞれがまばらに活性化するMXDサブレイヤーは、フルランクの重みで線形変換を実装します。
実験的に、MXDは、最大3Bパラメーターを持つ言語モデルのSparsity-Accuracy Frontierの最先端の方法（たとえば、トランスコダー）を大幅に上回ることを示します。
スパースプロービングと機能ステアリングに関するさらなる評価は、MXDが自然言語の同様に専門的な機能を学習することを示しています。
私たちのコードは、https：//github.com/james-oldfield/mxd/に含まれています。

要約(オリジナル)

Multilayer perceptrons (MLPs) are an integral part of large language models, yet their dense representations render them difficult to understand, edit, and steer. Recent methods learn interpretable approximations via neuron-level sparsity, yet fail to faithfully reconstruct the original mapping–significantly increasing model’s next-token cross-entropy loss. In this paper, we advocate for moving to layer-level sparsity to overcome the accuracy trade-off in sparse layer approximation. Under this paradigm, we introduce Mixture of Decoders (MxDs). MxDs generalize MLPs and Gated Linear Units, expanding pre-trained dense layers into tens of thousands of specialized sublayers. Through a flexible form of tensor factorization, each sparsely activating MxD sublayer implements a linear transformation with full-rank weights–preserving the original decoders’ expressive capacity even under heavy sparsity. Experimentally, we show that MxDs significantly outperform state-of-the-art methods (e.g., Transcoders) on the sparsity-accuracy frontier in language models with up to 3B parameters. Further evaluations on sparse probing and feature steering demonstrate that MxDs learn similarly specialized features of natural language–opening up a promising new avenue for designing interpretable yet faithful decompositions. Our code is included at: https://github.com/james-oldfield/MxD/.

arxiv情報

著者	James Oldfield,Shawn Im,Yixuan Li,Mihalis A. Nicolaou,Ioannis Patras,Grigorios G Chrysos
発行日	2025-05-27 15:55:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Interpretability Without Sacrifice: Faithful Dense Layer Decomposition with Mixture of Decoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー