Understanding and Improving Visual Prompting: A Label-Mapping Perspective

要約

ビジョンタスクの入力プロンプト手法であるビジュアルプロンプト (VP) を再検討し、進歩させます。
VP は、ユニバーサルプロンプト (入力摂動パターンの観点から) をダウンストリームデータポイントに組み込むだけで、固定された事前トレーニング済みのソースモデルを再プログラムして、ターゲットドメインでダウンストリームタスクを実行できます。
しかし、ソースクラスとターゲットクラスの間のルールのないラベルマッピング (LM) が与えられたとしても、なぜ VP が効果的であり続けるのかはわかりにくいままです。
上記に触発されて、私たちは質問します: LM は VP とどのように関連していますか?
そして、そのような関係をどのように活用して、対象タスクの精度を向上させるのでしょうか?
VP に対する LM の影響を詳しく調べ、LM の「品質」 (マッピングの精度と説明によって評価) が向上すると、VP の有効性が一貫して改善されるという肯定的な回答を提供します。
これは、ＬＭの係数が欠けていた従来技術とは対照的である。
LM を最適化するために、ILM-VP (反復ラベルマッピングベースのビジュアルプロンプト) と呼ばれる新しい VP フレームワークを提案します。これは、ソースラベルをターゲットラベルに自動的に再マッピングし、VP のターゲットタスクの精度を徐々に向上させます。
さらに、対照的な言語イメージの事前学習済み (CLIP) モデルを使用する場合、LM プロセスを統合して、CLIP のテキストプロンプト選択を支援し、ターゲットタスクの精度を向上させることを提案します。
広範な実験により、私たちの提案が最先端の VP メソッドよりも大幅に優れていることが示されています。
以下に強調表示されているように、ImageNet で事前トレーニングされた ResNet-18 を 13 のターゲットタスクに再プログラミングすると、ターゲットの Flowers102 および CIFAR100 データセットへの転移学習の精度が 7.9% および 6.7% 向上するなど、大幅な差でベースラインよりも優れていることがわかります。
さらに、CLIP ベースの VP に関する私たちの提案は、Flowers102 と DTD でそれぞれ 13.7% と 7.1% の精度向上を提供します。
コードは https://github.com/OPTML-Group/ILM-VP で入手できます。

要約(オリジナル)

We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better ‘quality’ of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. Our code is available at https://github.com/OPTML-Group/ILM-VP.

arxiv情報

著者	Aochuan Chen,Yuguang Yao,Pin-Yu Chen,Yihua Zhang,Sijia Liu
発行日	2022-11-21 16:49:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Understanding and Improving Visual Prompting: A Label-Mapping Perspective

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー