Exploring Vision-Language Models for Imbalanced Learning

要約

対照的な言語と画像の事前トレーニングを使用する視覚言語モデル (VLM) は、有望なゼロショット分類パフォーマンスを示しています。
ただし、不均衡なデータセットでのパフォーマンスは比較的低く、トレーニングデータセット内のクラスの分布が歪んでいるため、少数クラスの予測パフォーマンスが低下します。
たとえば、CLIP は iNaturalist18 データセットでわずか 5% の精度しか達成できませんでした。
多数のクラスによって引き起こされる OOM (メモリ不足) 問題を回避し、末尾クラスの微妙な特徴をキャプチャするために、VLM に軽量デコーダを追加することを提案します。
次に、プロンプトチューニング、微調整、および Focal Loss、Balanced SoftMax、Distribution Alignment などの不均衡アルゴリズムの組み込みを使用して、VLM の改善を検討します。
実験により、VLM のパフォーマンスは、デコーダおよびアンバランス方式と併用するとさらに向上することが実証されています。
具体的には、当社の改良された VLM は、ImageNet-LT、iNaturalist18、Places-LT でそれぞれ 6.58%、69.82%、6.17% の平均精度でゼロショット分類を大幅に上回っています。
さらに、トレーニング前のデータサイズ、バックボーン、トレーニングコストの影響を分析します。
私たちの研究は、膨大なデータによって事前トレーニングされた VLM に直面して、不均衡な学習アルゴリズムの重要性を強調しています。
コードは https://github.com/Imbalance-VLM/Imbalance-VLM でリリースされています。

要約(オリジナル)

Vision-Language models (VLMs) that use contrastive language-image pre-training have shown promising zero-shot classification performance. However, their performance on imbalanced dataset is relatively poor, where the distribution of classes in the training dataset is skewed, leading to poor performance in predicting minority classes. For instance, CLIP achieved only 5% accuracy on the iNaturalist18 dataset. We propose to add a lightweight decoder to VLMs to avoid OOM (out of memory) problem caused by large number of classes and capture nuanced features for tail classes. Then, we explore improvements of VLMs using prompt tuning, fine-tuning, and incorporating imbalanced algorithms such as Focal Loss, Balanced SoftMax and Distribution Alignment. Experiments demonstrate that the performance of VLMs can be further boosted when used with decoder and imbalanced methods. Specifically, our improved VLMs significantly outperforms zero-shot classification by an average accuracy of 6.58%, 69.82%, and 6.17%, on ImageNet-LT, iNaturalist18, and Places-LT, respectively. We further analyze the influence of pre-training data size, backbones, and training cost. Our study highlights the significance of imbalanced learning algorithms in face of VLMs pre-trained by huge data. We release our code at https://github.com/Imbalance-VLM/Imbalance-VLM.

arxiv情報

著者	Yidong Wang,Zhuohao Yu,Jindong Wang,Qiang Heng,Hao Chen,Wei Ye,Rui Xie,Xing Xie,Shikun Zhang
発行日	2023-06-21 15:44:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Vision-Language Models for Imbalanced Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー