Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

要約

最近の研究では、CLIP の共有画像テキスト表現空間を活用することにより、CLIP-ViT モデルの個々のコンポーネントが最終表現にどのように寄与するかを調査しました。
アテンションヘッドや MLP などのこれらのコンポーネントは、形状、色、テクスチャなどの明確な画像特徴をキャプチャすることが示されています。
ただし、任意のビジョントランスフォーマー (ViT) におけるこれらのコンポーネントの役割を理解するのは困難です。
この目的を達成するために、CLIP を超えて ViT のさまざまなコンポーネントの役割を特定できる一般的なフレームワークを導入します。
具体的には、(a) 最終表現のさまざまなモデルコンポーネントからの寄与への分解を自動化し、(b) これらの寄与を CLIP 空間に線形にマッピングしてテキスト経由で解釈します。
さらに、特定の機能に関する重要度によってコンポーネントをランク付けする新しいスコアリング関数を導入します。
私たちのフレームワークをさまざまな ViT バリアント (例: DeiT、DINO、DINOv2、Swin、MaxViT) に適用することで、特定の画像特徴に関するさまざまなコンポーネントの役割についての洞察が得られます。
これらの洞察により、テキスト説明や参照画像を使用した画像検索、トークン重要度のヒートマップの視覚化、偽の相関関係の軽減などのアプリケーションが容易になります。
実験を再現するコードを https://github.com/SriramB-98/vit-decompose で公開します。

要約(オリジナル)

Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to various ViT variants (e.g. DeiT, DINO, DINOv2, Swin, MaxViT), we gain insights into the roles of different components concerning particular image features. These insights facilitate applications such as image retrieval using text descriptions or reference images, visualizing token importance heatmaps, and mitigating spurious correlations. We release our code to reproduce the experiments at https://github.com/SriramB-98/vit-decompose

arxiv情報

著者	Sriram Balasubramanian,Samyadeep Basu,Soheil Feizi
発行日	2024-10-21 17:25:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー