Cross-Modal Concept Learning and Inference for Vision-Language Models

要約

CLIP などの大規模な事前トレーニング済み視覚言語モデル (VLM) は、テキストと画像の間の相関関係を確立し、微調整することでさまざまな下流タスクで目覚ましい成功を収めます。
既存の微調整方法では、クラス固有のテキストの説明が画像全体と照合されます。
同じクラスの画像には異なる意味論的オブジェクトのセットが含まれることが多く、オブジェクトはさらに意味論的な部分または概念のセットで構成されるため、この画像全体のマッチングは効果的ではないことを認識しています。
個々の意味部分または概念は、異なるクラスの画像サンプルに現れる場合があります。
この問題に対処するために、この論文では、クロスモデル概念学習と推論 (CCLI) と呼ばれる新しい方法を開発します。
CLIP の強力なテキストと画像の相関機能を使用して、私たちの方法は、一連の意味論的なテキスト概念を使用して、画像から大量の独特の視覚概念を自動的に学習します。
これらの視覚概念に基づいて、画像の識別表現を構築し、概念推論ネットワークを学習して、少数ショット学習や領域一般化などの下流の画像分類タスクを実行します。
広範な実験結果は、当社の CCLI 手法が現在の最先端の手法に比べてパフォーマンスを大幅に向上できることを示しています。たとえば、少数ショット学習では最大 8.0%、学習では最大 1.3% 向上します。
ドメインの一般化。

要約(オリジナル)

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP, establish the correlation between texts and images, achieving remarkable success on various downstream tasks with fine-tuning. In existing fine-tuning methods, the class-specific text description is matched against the whole image. We recognize that this whole image matching is not effective since images from the same class often contain a set of different semantic objects, and an object further consists of a set of semantic parts or concepts. Individual semantic parts or concepts may appear in image samples from different classes. To address this issue, in this paper, we develop a new method called cross-model concept learning and inference (CCLI). Using the powerful text-image correlation capability of CLIP, our method automatically learns a large set of distinctive visual concepts from images using a set of semantic text concepts. Based on these visual concepts, we construct a discriminative representation of images and learn a concept inference network to perform downstream image classification tasks, such as few-shot learning and domain generalization. Extensive experimental results demonstrate that our CCLI method is able to improve the performance upon the current state-of-the-art methods by large margins, for example, by up to 8.0% improvement on few-shot learning and by up to 1.3% for domain generalization.

arxiv情報

著者	Yi Zhang,Ce Zhang,Yushun Tang,Zhihai He
発行日	2023-07-28 10:26:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cross-Modal Concept Learning and Inference for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー