Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

要約

スパース自動エンコーダー（SAE）は、機械学習の解釈可能性の強力なフレームワークとして浮上しており、モデル表現の監視されていない分解を抽象的で人間の解釈可能な概念の辞書に可能にしています。
ただし、基本的な制限を明らかにします。既存のSAEは、同様のデータセットで訓練された同一のモデルが急激に異なる辞書を生成し、解釈可能性ツールとしての信頼性を損なう可能性があるため、深刻な不安定性を示します。
この問題に対処するために、Cutler＆Breiman（1994）および現在の典型的なSAE（A-SAE）によって導入された典型的な分析フレームワークからインスピレーションを引き出します。
この幾何学的な固定は、推定された辞書の安定性を大幅に向上させ、それらの軽度にリラックスしたバリアントのRA-SAEは、最新の再建能力にさらに一致します。
SAEによって学習した辞書の品質を厳密に評価するために、辞書が「真の」分類の方向を回復する場合、（i）妥当性をテストする2つの新しいベンチマークを導入し、（ii）辞書が合成概念の混合を解く場合、識別可能性を導入します。
すべての評価にわたって、RA-SAEは一貫してより構造化された表現を生成しながら、大規模なビジョンモデルにおける意味的に意味のある概念を明らかにします。

要約(オリジナル)

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover ‘true’ classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

arxiv情報

著者	Thomas Fel,Ekdeep Singh Lubana,Jacob S. Prince,Matthew Kowal,Victor Boutin,Isabel Papadimitriou,Binxu Wang,Martin Wattenberg,Demba Ba,Talia Konkle
発行日	2025-02-18 14:29:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー