Variational prompt tuning improves generalization of vision-language models

要約

プロンプトチューニングは、入力言語プロンプトの一部を学習可能なパラメータとして扱い、残りの部分を凍結することによって、大規模な視覚言語モデルを下流のタスクに適応させる効率的なメカニズムを提供する。しかし、既存のプロンプトチューニングの研究は、学習されたプロンプトが言語モデル内の特定の概念をカバーする能力を持たないため、基礎モデルの汎化能力を損なう傾向がある。このような制限を避けるために、我々はプロンプトの基礎となる分布の確率的モデリングを提案し、関連する概念のサポート範囲内のプロンプトを確率的サンプリングによって導出することを可能にする。この結果、言語モデルによって捉えられた情報をより完全かつ豊富に伝達することができ、下流のタスクに対してより優れた汎化能力を提供することができる。このアルゴリズムは、シンプルかつ強力な変分フレームワークに依存しており、他の開発と直接統合することが可能である。我々は、我々のアプローチが標準的なプロンプト学習と条件付きプロンプト学習の両方のフレームワークにシームレスに統合され、特に元のモデルの汎化能力を維持することに関して、両方のケースで性能を大幅に向上させることを示す。本手法は、標準的なベンチマークにおいて、CoCoOpを1.6%の平均トップ1精度で上回り、プロンプト学習の現在の最先端を提供する。驚くべきことに、新しいクラスへの汎化能力においても、オリジナルのCLIPモデルを凌駕している。実装コードを公開する予定です。

要約(オリジナル)

Prompt tuning provides an efficient mechanism to adapt large vision-language models to downstream tasks by treating part of the input language prompts as learnable parameters while freezing the rest of the model. Existing works for prompt tuning are however prone to damaging the generalization capabilities of the foundation models, because the learned prompts lack the capacity of covering certain concepts within the language model. To avoid such limitation, we propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks. The resulting algorithm relies on a simple yet powerful variational framework that can be directly integrated with other developments. We show our approach is seamlessly integrated into both standard and conditional prompt learning frameworks, improving the performance on both cases considerably, especially with regards to preserving the generalization capability of the original model. Our method provides the current state-of-the-art for prompt learning, surpassing CoCoOp by 1.6% average Top-1 accuracy on the standard benchmark. Remarkably, it even surpasses the original CLIP model in terms of generalization to new classes. Implementation code will be released.

arxiv情報

著者	Mohammad Mahdi Derakhshani,Enrique Sanchez,Adrian Bulat,Victor Guilherme Turrisi da Costa,Cees G. M. Snoek,Georgios Tzimiropoulos,Brais Martinez
発行日	2022-10-05 17:05:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Variational prompt tuning improves generalization of vision-language models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー