From What to How: Attributing CLIP’s Latent Components Reveals Unexpected Semantic Reliance

要約

トランスベースのクリップモデルは、テキストイメージの調査と特徴抽出に広く使用されており、予測の背後にある内部メカニズムを理解することが関連しています。
最近の作品は、スパースオートエンコーダー（SAE）が解釈可能な潜在コンポーネントを生成することを示していますが、これらのエンコードに焦点を当て、予測を推進する方法を逃しています。
潜在的なコンポーネントがアクティブ化するもの、予想されるセマンティクスとどのように整合するか、予測にとってどれほど重要かを明らかにするスケーラブルなフレームワークを紹介します。
これを達成するために、たとえばクリップでの属性パッチングを適応させ、広く使用されているロジットレンズ技術の重要な忠実さの制限を強調します。
属性をセマンティックアライメントスコアと組み合わせることにより、意味的に予期しない概念または偽の概念をコードするコンポーネントへの依存を自動的に明らかにすることができます。
複数のクリップバリエーションに適用されたこの方法は、多目的単語、複合名詞、視覚的なタイポグラフィ、およびデータセットアーティファクトにリンクされた何百もの驚くべきコンポーネントを明らかにします。
テキストの埋め込みは、セマンティックな曖昧さを起こしやすいままですが、画像の埋め込みで訓練された線形分類器と比較して、偽の相関に対してより堅牢です。
皮膚病変の検出に関するケーススタディは、そのような分類器が隠されたショートカットをどのように増幅するかを強調し、全体的で機械的な解釈可能性の必要性を強調しています。
https://github.com/maxdreyer/attributing-clipでコードを提供します。

要約(オリジナル)

Transformer-based CLIP models are widely used for text-image probing and feature extraction, making it relevant to understand the internal mechanisms behind their predictions. While recent works show that Sparse Autoencoders (SAEs) yield interpretable latent components, they focus on what these encode and miss how they drive predictions. We introduce a scalable framework that reveals what latent components activate for, how they align with expected semantics, and how important they are to predictions. To achieve this, we adapt attribution patching for instance-wise component attributions in CLIP and highlight key faithfulness limitations of the widely used Logit Lens technique. By combining attributions with semantic alignment scores, we can automatically uncover reliance on components that encode semantically unexpected or spurious concepts. Applied across multiple CLIP variants, our method uncovers hundreds of surprising components linked to polysemous words, compound nouns, visual typography and dataset artifacts. While text embeddings remain prone to semantic ambiguity, they are more robust to spurious correlations compared to linear classifiers trained on image embeddings. A case study on skin lesion detection highlights how such classifiers can amplify hidden shortcuts, underscoring the need for holistic, mechanistic interpretability. We provide code at https://github.com/maxdreyer/attributing-clip.

arxiv情報

著者	Maximilian Dreyer,Lorenz Hufe,Jim Berend,Thomas Wiegand,Sebastian Lapuschkin,Wojciech Samek
発行日	2025-05-26 17:08:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From What to How: Attributing CLIP’s Latent Components Reveals Unexpected Semantic Reliance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー