Learning to Prompt for Vision-Language Models

要約

CLIPのような大規模な事前学習済み視覚言語モデルは、下流の様々なタスクに渡って転送可能な表現を学習する上で大きな可能性を示しています。従来の離散化されたラベルに基づく表現学習とは異なり、ビジョン言語による事前学習では、画像とテキストを共通の特徴空間に配置することで、プロンプトによる下流タスクへのゼロショット転送を可能にする。つまり、関心クラスを記述する自然言語から分類重みが合成される。本研究では、このようなモデルを実用化するための主要な課題は、専門知識を必要とし、非常に時間のかかるプロンプトエンジニアリングであることを示す。本研究では、自然言語処理(NLP)におけるプロンプト学習研究の最近の進歩に触発されて、CLIPのような視覚言語モデルを下流の画像認識に適応させるためのシンプルなアプローチ、Context Optimization (COOp)を提案する。具体的には、CoOpはプロンプトの文脈語を学習可能なベクトルでモデル化し、事前に学習したパラメータは全て固定とする。CoOpは、統一文脈とクラス別文脈の2種類の実装を提供し、異なる画像認識タスクに対応できるようにする。11のデータセットに対する広範な実験を通して、CoOpは1、2ショットで手作りのプロンプトに十分なマージンをもって勝てること、さらにショット数が増えるほどプロンプト工学に対して大きな改善を得られることを実証する（例えば、16ショットで平均15%（最高45%以上）に達する）。CoOpは学習ベースのアプローチであるにもかかわらず、手作りのプロンプトを用いたゼロショットモデルと比較して、優れたドメイン汎化性能を達成することができる。

要約(オリジナル)

Large pre-trained vision-language models like CLIP have shown great potential in learning representations that are transferable across a wide range of downstream tasks. Different from the traditional representation learning that is based mostly on discretized labels, vision-language pre-training aligns images and texts in a common feature space, which allows zero-shot transfer to a downstream task via prompting, i.e., classification weights are synthesized from natural language describing classes of interest. In this work, we show that a major challenge for deploying such models in practice is prompt engineering, which requires domain expertise and is extremely time-consuming — one needs to spend a significant amount of time on words tuning since a slight change in wording could have a huge impact on performance. Inspired by recent advances in prompt learning research in natural language processing (NLP), we propose Context Optimization (CoOp), a simple approach specifically for adapting CLIP-like vision-language models for downstream image recognition. Concretely, CoOp models a prompt’s context words with learnable vectors while the entire pre-trained parameters are kept fixed. To handle different image recognition tasks, we provide two implementations of CoOp: unified context and class-specific context. Through extensive experiments on 11 datasets, we demonstrate that CoOp requires as few as one or two shots to beat hand-crafted prompts with a decent margin and is able to gain significant improvements over prompt engineering with more shots, e.g., with 16 shots the average gain is around 15% (with the highest reaching over 45%). Despite being a learning-based approach, CoOp achieves superb domain generalization performance compared with the zero-shot model using hand-crafted prompts.

arxiv情報

著者	Kaiyang Zhou,Jingkang Yang,Chen Change Loy,Ziwei Liu
発行日	2022-10-06 11:36:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Learning to Prompt for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー