SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

要約

少数ショット学習では大きな進歩がありましたが、既存の少数ショット画像分類方法のほとんどは、基底クラスの大量のサンプルに対する監視付きの事前トレーニングを必要とし、現実世界のアプリケーションでの一般化能力を制限しています。
最近、大規模な視覚言語事前訓練済みモデル (VLP) は、Web 上で簡単に入手できるテキストを使用して、伝達可能な視覚的表現学習の新しいパラダイムを提供できるため、少数ショット学習でますます注目を集めています。
ただし、VLP は、言語の文章で説明するのが難しい詳細な視覚情報を無視する可能性がありますが、さまざまな画像を区別する効果的な分類器を学習するために重要です。
上記の問題に対処するために、セマンティックガイド付きビジュアルアダプティング (SgVA) という名前の新しいフレームワークを提案します。
特定のコントラスト損失、およびクロスモーダルコントラスト損失。
暗黙的な知識の蒸留は、ビジョンアダプターの更新をガイドするために、きめの細かいクロスモーダル知識を転送するように設計されています。
13 のデータセットに関する最先端の結果は、適応された視覚的機能がクロスモーダル機能を十分に補完して、少数ショットの画像分類を改善できることを示しています。

要約(オリジナル)

Although significant progress has been made in few-shot learning, most of existing few-shot image classification methods require supervised pre-training on a large amount of samples of base classes, which limits their generalization ability in real world application. Recently, large-scale Vision-Language Pre-trained models (VLPs) have been gaining increasing attention in few-shot learning because they can provide a new paradigm for transferable visual representation learning with easily available text on the Web. However, the VLPs may neglect detailed visual information that is difficult to describe by language sentences, but important for learning an effective classifier to distinguish different images. To address the above problem, we propose a new framework, named Semantic-guided Visual Adapting (SgVA), which can effectively extend vision-language pre-trained models to produce discriminative adapted visual features by comprehensively using an implicit knowledge distillation, a vision-specific contrastive loss, and a cross-modal contrastive loss. The implicit knowledge distillation is designed to transfer the fine-grained cross-modal knowledge to guide the updating of the vision adapter. State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.

arxiv情報

著者	Fang Peng,Xiaoshan Yang,Linhui Xiao,Yaowei Wang,Changsheng Xu
発行日	2023-01-20 13:56:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for Few-shot Image Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー