Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

要約

ビジョン言語モデル (VLM) は、画像とテキストのペアの大規模なトレーニングデータセットから高品質の表現を学習できます。
即時学習は、VLM を微調整して下流のタスクに適応させるための一般的なアプローチです。
満足のいくパフォーマンスにもかかわらず、即時学習の主な制限は、ラベル付きデータの需要です。
実際のシナリオでは、データのプライバシーや機密性の問題により、真のラベルではなく、候補ラベル (真のラベルが含まれる場合) のみを取得する場合があります。
この論文では、VLM の候補ラベルを使用した即時学習に関する最初の研究を提供します。
候補ラベルを処理するには、即時学習が他の微調整方法よりも有利であることを経験的に示しています。
それにもかかわらず、ラベルの曖昧さが増大すると、そのパフォーマンスは低下します。
その堅牢性を向上させるために、VLM の事前知識をより適切に活用して、候補ラベルを使用して学習プロセスをガイドする、シンプルでありながら効果的なフレームワークを提案します。
具体的には、私たちのフレームワークは、学習可能なプロンプトと手作りのプロンプトの両方によって共同予測された混合クラス事後分布とモデルの出力を位置合わせすることで、候補ラベルの曖昧さを解消します。
さらに、私たちのフレームワークには、候補者のラベルを使用して学習し、パフォーマンスをさらに向上させるためのさまざまな既製のトレーニング目標を装備できます。
広範な実験により、私たちが提案したフレームワークの有効性が実証されました。

要約(オリジナル)

Vision-language models (VLMs) can learn high-quality representations from a large-scale training dataset of image-text pairs. Prompt learning is a popular approach to fine-tuning VLM to adapt them to downstream tasks. Despite the satisfying performance, a major limitation of prompt learning is the demand for labelled data. In real-world scenarios, we may only obtain candidate labels (where the true label is included) instead of the true labels due to data privacy or sensitivity issues. In this paper, we provide the first study on prompt learning with candidate labels for VLMs. We empirically demonstrate that prompt learning is more advantageous than other fine-tuning methods, for handling candidate labels. Nonetheless, its performance drops when the label ambiguity increases. In order to improve its robustness, we propose a simple yet effective framework that better leverages the prior knowledge of VLMs to guide the learning process with candidate labels. Specifically, our framework disambiguates candidate labels by aligning the model output with the mixed class posterior jointly predicted by both the learnable and the handcrafted prompt. Besides, our framework can be equipped with various off-the-shelf training objectives for learning with candidate labels to further improve their performance. Extensive experiments demonstrate the effectiveness of our proposed framework.

arxiv情報

著者	Zhifang Zhang,Beibei Li
発行日	2024-07-11 04:46:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Tuning Vision-Language Models with Candidate Labels by Prompt Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー