Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

要約

事前にトレーニングされたビジョン言語モデル (CLIP など) は、適切に設計されたテキストプロンプトを使用して、多くのダウンストリームタスクで有望なゼロショット一般化を示しています。
手作業で設計されたプロンプトに依存する代わりに、最近の研究では、下流のタスクからのトレーニングデータを使用してプロンプトを学習します。
ドメイン固有のデータのトレーニングは効果的ですが、モデルの一般化機能を目に見えない新しいドメインに減らします。
この作業では、テスト時のプロンプトチューニング (TPT) を提案します。これは、単一のテストサンプルでその場で適応プロンプトを学習できる方法です。
画像分類の場合、TPT は、モデルが各テストサンプルのさまざまな拡張ビューにわたって一貫した予測を行うように、信頼性のある選択でエントロピーを最小化することによってプロンプトを最適化します。
自然分布シフトへの一般化を評価する際、TPT は、CLIP のゼロショットトップ 1 精度を平均で 3.6% 向上させ、追加のタスク固有のトレーニングデータを必要とする以前の迅速なチューニングアプローチを上回ります。
目に見えないカテゴリを持つデータセット間の一般化を評価する際に、TPT は、追加のトレーニングデータを使用する最先端のアプローチと同等のパフォーマンスを発揮します。
プロジェクトページ: https://azshue.github.io/TPT。

要約(オリジナル)

Pre-trained vision-language models (e.g., CLIP) have shown promising zero-shot generalization in many downstream tasks with properly designed text prompts. Instead of relying on hand-engineered prompts, recent works learn prompts using the training data from downstream tasks. While effective, training on domain-specific data reduces a model’s generalization capability to unseen new domains. In this work, we propose test-time prompt tuning (TPT), a method that can learn adaptive prompts on the fly with a single test sample. For image classification, TPT optimizes the prompt by minimizing the entropy with confidence selection so that the model has consistent predictions across different augmented views of each test sample. In evaluating generalization to natural distribution shifts, TPT improves the zero-shot top-1 accuracy of CLIP by 3.6% on average, surpassing previous prompt tuning approaches that require additional task-specific training data. In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data. Project page: https://azshue.github.io/TPT.

arxiv情報

著者	Manli Shu,Weili Nie,De-An Huang,Zhiding Yu,Tom Goldstein,Anima Anandkumar,Chaowei Xiao
発行日	2022-09-15 17:55:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー