Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

要約

ロバストなモデルとロバストでないモデルの違いは何でしょうか?
ImageNet の分布の変化については、そのような堅牢性の違いは主にトレーニングデータの違いに遡ることができることが示されていますが、モデルが学習した内容に関してそれがどのような影響を与えるかは今のところ不明です。
この研究では、さまざまなバックボーン (ResNet および ViT) と事前トレーニングセット (OpenAI、LAION-400M、LAION-2B、YFCC15M、CC12M、および {DataComp) を備えた 16 個の堅牢なゼロショット CLIP ビジョンエンコーダーの表現空間を調査することで、このギャップを埋めます。
})、それらを、同一のバックボーンを持つ堅牢性の低いモデルの表現空間と比較しますが、(事前) トレーニングセットまたは目的は異なります (CLIP)
ImageNet-Captions での事前トレーニング、ImageNet での教師ありトレーニングまたは微調整)。この分析を通じて、3 つの新しい洞察が得られます。
まず、堅牢なゼロショット CLIP ビジョンエンコーダで外れ値特徴の存在を検出しました。これは、私たちの知る限り、非言語および非トランスフォーマーモデルでこれらの特徴が観察されたのは初めてです。
第 2 に、外れ値特徴の存在は、分析では堅牢なモデルでのみ検出されるため、モデルにおける ImageNet シフトの堅牢性を示すものであることがわかります。
最後に、表現空間内でエンコードされた一意の概念の数も調査し、表現空間内でより多くの一意の概念をエンコードするためのゼロショット CLIP モデルを見つけます。
しかし、我々はこれが ImageNet シフトの堅牢性の指標であるとは考えず、むしろ言語監視に関連していると仮説を立てています。
外れ値特徴の存在は、シフトされたデータセットのデータにアクセスすることなく検出できるため、実務者が展開中に事前トレーニング済みモデルの分布シフトの堅牢性を感覚的に把握するための有用なツールになる可能性があると考えられます。

要約(オリジナル)

What distinguishes robust models from non-robust ones? While for ImageNet distribution shifts it has been shown that such differences in robustness can be traced back predominantly to differences in training data, so far it is not known what that translates to in terms of what the model has learned. In this work, we bridge this gap by probing the representation spaces of 16 robust zero-shot CLIP vision encoders with various backbones (ResNets and ViTs) and pretraining sets (OpenAI, LAION-400M, LAION-2B, YFCC15M, CC12M and {DataComp}), and comparing them to the representation spaces of less robust models with identical backbones, but different (pre)training sets or objectives (CLIP pretraining on ImageNet-Captions, and supervised training or finetuning on ImageNet).Through this analysis, we generate three novel insights. Firstly, we detect the presence of outlier features in robust zero-shot CLIP vision encoders, which to the best of our knowledge is the first time these are observed in non-language and non-transformer models. Secondly, we find the existence of outlier features to be an indication of ImageNet shift robustness in models, since we only find them in robust models in our analysis. Lastly, we also investigate the number of unique encoded concepts in the representation space and find zero-shot CLIP models to encode a higher number of unique concepts in their representation space. However, we do not find this to be an indicator of ImageNet shift robustness and hypothesize that it is rather related to the language supervision. Since the presence of outlier features can be detected without access to any data from shifted datasets, we believe that they could be a useful tool for practitioners to get a feeling for the distribution shift robustness of a pretrained model during deployment.

arxiv情報

著者	Jonathan Crabbé,Pau Rodríguez,Vaishaal Shankar,Luca Zappella,Arno Blaas
発行日	2024-11-07 15:40:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interpreting CLIP: Insights on the Robustness to ImageNet Distribution Shifts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー