Language-Aware Soft Prompting for Vision & Language Foundation Models

要約

本論文は、Vision \&Language (V&L)モデルのためのソフトプロンプト学習に関するものである。V&Lモデルは、NLPモデルと同様に、少数の学習例を用いて柔らかい連続プロンプトを学習することにより、下流タスクに適応することができる。現在の手法は、プロンプトとクラス名をテキストエンコーダーに渡すことで得られる特徴をクラス重みとして用い、クロスエントロピーの損失を最小化することでソフトプロンプトを学習する。しかし、このような方法では、学習データを大幅にオーバーフィットさせるため、同じ領域の未知のクラスでテストしたときに精度が大きく低下する。本論文の主な貢献は、この問題を軽減するための驚くほど簡単なアプローチである。第二のクロスエントロピー損失を用いて、学習したソフトプロンプトと（プロンプトエンジニアリングによって得られた）手動プロンプトのセットとの間の距離を最小化することである。提案する損失は、正則化、言語ベースの補強、より識別性の高いクラスセントロイドを学習する方法など、様々な方法で解釈することができる。重要なことは、我々の定式化は、学習中に仮想クラス、すなわち、視覚的サンプルが利用できないクラス名を含めることが本質的に可能であり、学習したプロンプトの頑健性をさらに向上させることである。11個のデータセットを用いた広範な評価により、我々のアプローチは(a)ソフトプロンプトに関する全ての先行研究を著しく凌駕し、(b)大部分のテストデータセットにおいて、手作りのプロンプトやCLIPによって得られた新規クラスに関する精度と初めて一致し、それを上回ることが示された。コードは公開される予定です。

要約(オリジナル)

This paper is on soft prompt learning for Vision \& Language (V&L) models. Similarly to their NLP counterparts, V\&L models can be adapted to a downstream task by learning soft continuous prompts using a few training examples. Current methods learn the soft prompts by minimizing a cross-entropy loss using as class weights the features obtained by passing the prompts plus the class names through the text encoder. Such methods, however, significantly overfit the training data suffering from large accuracy degradation when tested on unseen classes from the same domain. Our main contribution, in this paper, is a surprisingly simple approach to alleviate this problem: we use a second cross entropy loss to minimize the distance between the learned soft prompts and a set of hand-engineered manual prompts (obtained by prompt engineering). The proposed loss can be interpreted in multiple ways including as a regularizer, as a means for language-based augmentation, and as a way of learning more discriminative class centroids. Importantly, our formulation is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through extensive evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for the majority of the test datasets. Code will be made available.

arxiv情報

著者	Adrian Bulat,Georgios Tzimiropoulos
発行日	2022-10-03 17:56:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Language-Aware Soft Prompting for Vision & Language Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー