Generalizable Prompt Tuning for Vision-Language Models

要約

CLIP などのビジョン言語モデルのプロンプト調整には、特定の下流タスク用の画像とテキストのペアの生成に使用されるテキストプロンプトの最適化が含まれます。
手作りまたはテンプレートベースのプロンプトは一般に、より広範囲の目に見えないクラスに適用できますが、下流のタスク (つまり、目に見えるクラス) ではパフォーマンスが低下する傾向があります。
一方、学習可能なソフトプロンプトは、多くの場合、下流のタスクではうまく機能しますが、汎用性に欠けます。
さらに、先行研究は主にテキストモダリティに焦点を当てており、視覚モダリティからプロンプトの一般化の可能性を探ろうとした研究はほとんどありませんでした。
これらの制限を念頭に置き、競争力のある下流のパフォーマンスと汎用性の両方を得るためにチューニングを促す方法を調査します。
この研究は、ソフトプロンプトと手作りプロンプトをテキストモダリティの二重ビューとして扱い、それらの相互情報を最大化することで、タスク固有の情報と一般的な意味情報をより適切にアンサンブルできることを示しています。
さらに、より表現力豊かなプロンプトを生成するために、この研究では視覚モダリティからクラスごとの拡張を導入し、その結果、より広範囲の目に見えないクラスに対する大幅な堅牢性が実現しました。
いくつかのベンチマークに関する広範な評価により、提案されたアプローチがタスク固有のパフォーマンスと一般的な能力の両方の点で競争力のある結果を達成することが報告されています。

要約(オリジナル)

Prompt tuning for vision-language models such as CLIP involves optimizing the text prompts used to generate image-text pairs for specific downstream tasks. While hand-crafted or template-based prompts are generally applicable to a wider range of unseen classes, they tend to perform poorly in downstream tasks (i.e., seen classes). Learnable soft prompts, on the other hand, often perform well in downstream tasks but lack generalizability. Additionally, prior research has predominantly concentrated on the textual modality, with very few studies attempting to explore the prompt’s generalization potential from the visual modality. Keeping these limitations in mind, we investigate how to prompt tuning to obtain both a competitive downstream performance and generalization. The study shows that by treating soft and hand-crafted prompts as dual views of the textual modality, and maximizing their mutual information, we can better ensemble task-specific and general semantic information. Moreover, to generate more expressive prompts, the study introduces a class-wise augmentation from the visual modality, resulting in significant robustness to a wider range of unseen classes. Extensive evaluations on several benchmarks report that the proposed approach achieves competitive results in terms of both task-specific performance and general abilities.

arxiv情報

著者	Qian Zhang
発行日	2024-10-23 16:22:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generalizable Prompt Tuning for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー