Interpreting the Second-Order Effects of Neurons in CLIP

要約

CLIP では、テキストを使用して個々のニューロンの機能を自動的に記述することで、個々のニューロンの機能を解釈します。
直接的な効果 (つまり、ニューロンから残差ストリームを通って出力までの流れ) または間接的な効果 (全体的な寄与) を分析しても、CLIP でニューロンの機能を捉えることはできません。
したがって、ニューロンから後のアテンションヘッドを介して出力に直接流れる効果を分析する「二次レンズ」を提示します。
これらの効果は非常に選択的であることがわかりました。各ニューロンについて、効果は画像の 2% 未満で有意です。
さらに、各効果は、CLIP のテキスト – 画像空間内の単一の方向で近似できます。
これらの方向をテキスト表現のまばらなセットに分解することによってニューロンを記述します。
セットは多意味の動作を明らかにします。各ニューロンは複数の、多くの場合無関係な概念 (船や車など) に対応します。
このニューロンの多義性を利用して、誤ったクラスに擬似的に関連付けられた概念を含む画像を生成することで、「意味論的な」敵対的な例を大量に生成します。
さらに、ゼロショットセグメンテーションと画像内の属性検出に二次効果を使用します。
私たちの結果は、ニューロンのスケーラブルな理解がモデルの欺瞞や新しいモデル機能の導入に使用できることを示しています。

要約(オリジナル)

We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons’ function in CLIP. Therefore, we present the ‘second-order lens’, analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce 'semantic' adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.

arxiv情報

著者	Yossi Gandelsman,Alexei A. Efros,Jacob Steinhardt
発行日	2024-06-06 17:59:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpreting the Second-Order Effects of Neurons in CLIP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー