Prompt Vision Transformer for Domain Generalization

要約

ビジョントランスフォーマー (ViT) は、表現学習に関して優れた能力を発揮してきましたが、以前のドメイン一般化アルゴリズムでは目に見えないドメインにうまく一般化できないことが経験的にわかっています。
この論文では、プロンプト学習に基づいた新しいアプローチ DoPrompt を提案し、ソースドメインの知識をドメインプロンプトに埋め込んでターゲットドメインを予測します。
具体的には、ドメインプロンプトは、対応するソースドメインからの ViT 入力トークンの前に付加されます。
各ドメインプロンプトは、1 つのドメインに対してのみ最適化されているため、ドメイン固有の知識を効率的に学習します。
一方、学習したソースドメインプロンプトに基づいて、入力画像ごとに適切なプロンプトを生成するプロンプトアダプターをトレーニングします。
テスト時に、プロンプトアダプターによって生成された適応プロンプトは、ドメイン外のイメージの特徴とソースドメインの類似性を利用して、ソースドメインの知識を適切に統合できます。
4 つのベンチマークデータセットに対して広範な実験が行われます。
私たちのアプローチは、平均精度で 1.4% の改善を達成しました。これは、ViT バックボーンを使用した最先端のアルゴリズムの 3.5 倍の改善です。

要約(オリジナル)

Though vision transformers (ViTs) have exhibited impressive ability for representation learning, we empirically find that they cannot generalize well to unseen domains with previous domain generalization algorithms. In this paper, we propose a novel approach DoPrompt based on prompt learning to embed the knowledge of source domains in domain prompts for target domain prediction. Specifically, domain prompts are prepended before ViT input tokens from the corresponding source domain. Each domain prompt learns domain-specific knowledge efficiently since it is optimized only for one domain. Meanwhile, we train a prompt adapter to produce a suitable prompt for each input image based on the learned source domain prompts. At test time, the adapted prompt generated by the prompt adapter can exploit the similarity between the feature of the out-of-domain image and source domains to properly integrate the source domain knowledge. Extensive experiments are conducted on four benchmark datasets. Our approach achieves 1.4% improvements in the averaged accuracy, which is 3.5 times the improvement of the state-of-the-art algorithm with a ViT backbone.

arxiv情報

著者	Zangwei Zheng,Xiangyu Yue,Kai Wang,Yang You
発行日	2022-08-18 15:34:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt Vision Transformer for Domain Generalization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー