Unified Vision and Language Prompt Learning

要約

CLIP のような大規模なビジョン言語モデルが登場して以来、プロンプトチューニング (モデルの入力空間内の少数のパラメーターのみを調整する、パラメーターおよびデータ効率の高い転送学習パラダイム) は、ビジョンコミュニティのトレンドになっています。
テキストプロンプトチューニングとビジュアルプロンプトチューニングという 2 つの代表的なプロンプトチューニング方法に関する体系的な研究を紹介します。
主な調査結果は、ユニモーダルプロンプトチューニング方法のいずれも一貫してうまく機能しないことです。テキストプロンプトチューニングは、クラス内の視覚的分散が高いデータでは失敗しますが、ビジュアルプロンプトチューニングは、クラス間分散が小さいデータを処理できません。
両方の長所を組み合わせるために、Unified Prompt Tuning (UPT) と呼ばれるシンプルなアプローチを提案します。これは、基本的に小さなニューラルネットワークを学習して、異なるモダリティ間でプロンプトを共同で最適化します。
11 を超えるビジョンデータセットに関する広範な実験により、UPT は、少数ショット学習ベンチマークおよびドメイン一般化ベンチマークで、ユニモーダルカウンターパートよりも優れたトレードオフを達成することが示されています。
将来の研究を容易にするために、コードとモデルがリリースされます。

要約(オリジナル)

Prompt tuning, a parameter- and data-efficient transfer learning paradigm that tunes only a small number of parameters in a model’s input space, has become a trend in the vision community since the emergence of large vision-language models like CLIP. We present a systematic study on two representative prompt tuning methods, namely text prompt tuning and visual prompt tuning. A major finding is that none of the unimodal prompt tuning methods performs consistently well: text prompt tuning fails on data with high intra-class visual variances while visual prompt tuning cannot handle low inter-class variances. To combine the best from both worlds, we propose a simple approach called Unified Prompt Tuning (UPT), which essentially learns a tiny neural network to jointly optimize prompts across different modalities. Extensive experiments on over 11 vision datasets show that UPT achieves a better trade-off than the unimodal counterparts on few-shot learning benchmarks, as well as on domain generalization benchmarks. Code and models will be released to facilitate future research.

arxiv情報

著者	Yuhang Zang,Wei Li,Kaiyang Zhou,Chen Huang,Chen Change Loy
発行日	2022-10-13 17:50:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unified Vision and Language Prompt Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー