Patch-Token Aligned Bayesian Prompt Learning for Vision-Language Models

要約

ビジョン言語の事前トレーニング済みモデルのダウンストリームアプリケーションでは、効果的なプロンプトの構築に大きな関心が寄せられています。
骨の折れる手動設計を必要とするか、点推定問題としてプロンプトチューニングを最適化するプロンプトエンジニアリングに関する既存の作業は、カテゴリの多様な特性を記述できず、そのアプリケーションを制限する可能性があります。
プロンプト学習にベイジアン確率的解決を導入します。ここでは、最初に基になる分布から潜在ベクトルをサンプリングし、次に軽量生成モデルを採用することにより、ラベル固有の確率的プロンプトが階層的に生成されます。
重要なことは、視覚的な知識を使用してプロンプト学習を意味的に正則化し、画像と対応するプロンプトを最適なトランスポートの下でパッチとトークンのセットとして表示することです。これにより、トレーニングカテゴリをオーバーフィッティングするのではなく、プロンプトトークンをプッシュして、ラベル固有の視覚的概念を忠実に捉えることができます。
さらに、提案されたモデルは、一般化可能性を改善するためにインスタンス条件付きプロンプトが生成される条件付きケースに直接拡張することもできます。
15 のデータセットでの広範な実験により、提案されたモデルの有望な転送可能性と一般化パフォーマンスが示されました。

要約(オリジナル)

For downstream applications of vision-language pre-trained models, there has been significant interest in constructing effective prompts. Existing works on prompt engineering, which either require laborious manual designs or optimize the prompt tuning as a point estimation problem, may fail to describe diverse characteristics of categories and limit their applications. We introduce a Bayesian probabilistic resolution to prompt learning, where the label-specific stochastic prompts are generated hierarchically by first sampling a latent vector from an underlying distribution and then employing a lightweight generative model. Importantly, we semantically regularize prompt learning with the visual knowledge and view images and the corresponding prompts as patch and token sets under optimal transport, which pushes the prompt tokens to faithfully capture the label-specific visual concepts, instead of overfitting the training categories. Moreover, the proposed model can also be straightforwardly extended to the conditional case where the instance-conditional prompts are generated to improve the generalizability. Extensive experiments on 15 datasets show promising transferability and generalization performance of our proposed model.

arxiv情報

著者	Xinyang Liu,Dongsheng Wang,Miaoge Li,Zhibin Duan,Yishi Xu,Bo Chen,Mingyuan Zhou
発行日	2023-03-16 06:09:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Patch-Token Aligned Bayesian Prompt Learning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー