Active Learning for Vision-Language Models

要約

CLIP のような事前トレーニング済みビジョン言語モデル (VLM) は、幅広い下流のコンピュータービジョンタスクで優れたゼロショットパフォーマンスを実証しています。
ただし、これらのモデルと、下流のデータセットでトレーニングされた教師ありディープモデルとの間には、依然としてかなりのパフォーマンスギャップが存在します。
このギャップを埋めるために、トレーニング中にアノテーション用にラベルなしデータから少数の有益なサンプルのみを選択することで、VLM のゼロショット分類パフォーマンスを強化する新しいアクティブラーニング (AL) フレームワークを提案します。
これを達成するために、私たちのアプローチでは、まず VLM の予測エントロピーを校正し、次に自己不確実性と近隣認識不確実性の組み合わせを利用して、アクティブなサンプル選択のための信頼できる不確実性の尺度を計算します。
私たちの広範な実験により、提案されたアプローチがいくつかの画像分類データセットに対して既存の AL アプローチを上回り、VLM のゼロショットパフォーマンスが大幅に向上することが示されています。

要約(オリジナル)

Pre-trained vision-language models (VLMs) like CLIP have demonstrated impressive zero-shot performance on a wide range of downstream computer vision tasks. However, there still exists a considerable performance gap between these models and a supervised deep model trained on a downstream dataset. To bridge this gap, we propose a novel active learning (AL) framework that enhances the zero-shot classification performance of VLMs by selecting only a few informative samples from the unlabeled data for annotation during training. To achieve this, our approach first calibrates the predicted entropy of VLMs and then utilizes a combination of self-uncertainty and neighbor-aware uncertainty to calculate a reliable uncertainty measure for active sample selection. Our extensive experiments show that the proposed approach outperforms existing AL approaches on several image classification datasets, and significantly enhances the zero-shot performance of VLMs.

arxiv情報

著者	Bardia Safaei,Vishal M. Patel
発行日	2024-10-29 16:25:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Active Learning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー