Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

要約

CLIP などの事前トレーニング済みのビジョン言語モデル (VLM) は、適切なテキストプロンプトを使用して、下流のビジョンタスクで印象的な一般化機能を示しています。
プロンプトを手動で設計する代わりに、コンテキスト最適化 (CoOp) が最近提案され、タスク固有のトレーニングデータを使用して継続的なプロンプトを学習します。
ダウンストリームタスクのパフォーマンスが向上したにもかかわらず、いくつかの研究では、CoOp が次の 2 つの側面でオーバーフィッティングの問題に悩まされていることが報告されています。
減少しています。
ただし、既存の研究のいずれも、このようなオーバーフィッティングの問題を理解して軽減することはできません。
この研究では、最初に勾配の流れを分析することにより、オーバーフィッティングの原因を探ります。
比較実験により、CoOp はトレーニングの初期段階と後期段階でそれぞれ一般化可能な機能と偽の機能を優先し、過学習と過学習の現象につながることが明らかになりました。
これらの観察結果を考慮して、サブスペースプロンプトチューニング (SubPT) を提案し、トレーニングプロセス全体で初期段階の勾配フロー固有ベクトルがまたがる低ランク部分空間に逆伝播の勾配を投影し、オーバーフィッティングの問題を首尾よく排除します。
さらに、CoOp に Novel Feature Learner (NFL) を装備して、画像トレーニングデータを必要とせずに、学習したプロンプトをトレーニングセットを超えた新しいカテゴリに一般化する能力を強化します。
11 の分類データセットに関する広範な実験により、SubPT+NFL が一貫して CoOp のパフォーマンスを向上させ、最先端の CoCoOp アプローチよりも優れていることが実証されています。
オープン語彙オブジェクト検出やゼロショットセマンティックセグメンテーションなど、より困難なビジョンダウンストリームタスクの実験でも、提案された方法の有効性が検証されます。
コードは https://tinyurl.com/mpe64f89 にあります。

要約(オリジナル)

Pretrained vision-language models (VLMs) such as CLIP have shown impressive generalization capability in downstream vision tasks with appropriate text prompts. Instead of designing prompts manually, Context Optimization (CoOp) has been recently proposed to learn continuous prompts using taskspecific training data. Despite the performance improvements on downstream tasks, several studies have reported that CoOp suffers from the overfitting issue in two aspects: (i) the test accuracy on base classes first improves and then worsens during training;(ii) the test accuracy on novel classes keeps decreasing. However, none of the existing studies can understand and mitigate such overfitting problems. In this study, we first explore the cause of overfitting by analyzing the gradient flow. Comparative experiments reveal that CoOp favors generalizable and spurious features in the early and later training stages, respectively, leading to the non-overfitting and overfitting phenomena. Given those observations, we propose Subspace Prompt Tuning (SubPT) to project the gradients in back-propagation onto the low-rank subspace spanned by the early-stage gradient flow eigenvectors during the entire training process and successfully eliminate the overfitting problem. In addition, we equip CoOp with a Novel Feature Learner (NFL) to enhance the generalization ability of the learned prompts onto novel categories beyond the training set, needless of image training data. Extensive experiments on 11 classification datasets demonstrate that SubPT+NFL consistently boost the performance of CoOp and outperform the state-of-the-art CoCoOp approach. Experiments on more challenging vision downstream tasks, including open-vocabulary object detection and zero-shot semantic segmentation, also verify the effectiveness of the proposed method. Codes can be found at https://tinyurl.com/mpe64f89.

arxiv情報

著者	Chengcheng Ma,Yang Liu,Jiankang Deng,Lingxi Xie,Weiming Dong,Changsheng Xu
発行日	2023-02-14 14:01:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー