Perceptual Grouping in Vision-Language Models

要約

ゼロショット画像認識の最近の進歩は、視覚言語モデルが、自然言語句で任意に調査できる高度な意味情報を含む一般的な視覚的表現を学習することを示唆しています。
ただし、画像を理解することは、画像内にあるコンテンツを理解することだけではなく、重要なことに、そのコンテンツがどこにあるかを理解することでもあります。
この作業では、視覚言語モデルが画像内のオブジェクトの場所を理解し、画像の視覚的に関連する部分をグループ化する方法を調べます。
対照的な損失と大規模な Web ベースのデータに基づく現代の視覚と言語表現の学習モデルが、限られたオブジェクトのローカリゼーション情報をどのようにキャプチャするかを示します。
意味情報と空間情報の両方を一意に学習するモデルをもたらす最小限の変更セットを提案します。
このパフォーマンスは、ゼロショット画像認識、教師なしのボトムアップおよびトップダウンのセマンティックセグメンテーション、および堅牢性分析の観点から測定します。
結果として得られるモデルは、教師なしセグメンテーションに関して最先端の結果を達成することがわかり、視覚モデルの因果関係を調査するために設計されたデータセットで、学習された表現が偽の相関に対して一意に堅牢であることを示します。

要約(オリジナル)

Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.

arxiv情報

著者	Kanchana Ranasinghe,Brandon McKinzie,Sachin Ravi,Yinfei Yang,Alexander Toshev,Jonathon Shlens
発行日	2022-10-18 17:01:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Perceptual Grouping in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー