Multi-Prompt with Depth Partitioned Cross-Modal Learning

要約

近年、さまざまな下流タスク向けに大規模な視覚言語の事前トレーニング済みモデルを微調整するためのソフトプロンプト学習手法が提案されています。
これらのメソッドは通常、学習可能なテキストトークンと、固定パラメーターを持つモデルの入力としてのクラストークンを組み合わせます。
ただし、多くの場合、クラスのコンテキストを説明するために単一のプロンプトが使用され、カテゴリの多様な属性を適切に把握できません。
この研究では、ソフトプロンプトを単一の学習可能なプロンプトから複数のプロンプトに拡張するマルチモーダルプロンプト手法である、パーティション化マルチモーダルプロンプト (PMPO) を紹介します。
私たちの方法は、視覚エンコーダーの深さを分割し、学習可能なプロンプトを分離された視覚深さに接続し、さまざまなプロンプトが視覚表現の階層的なコンテキストの深さをキャプチャできるようにします。
さらに、マルチプロンプト学習の利点を最大限に活用するために、手動で設計されたテンプレートと学習可能なマルチプロンプトからの事前情報を組み込み、アプローチの一般化機能を向上させます。
新しいクラスの一般化、データセット間の評価、ドメインの一般化という 3 つの困難なタスクに対するアプローチの有効性を評価します。
たとえば、私たちの手法は、11 の多様な画像認識データセットの平均で $79.28$ の調和平均を達成し (CoOp と比較して $+7.62$)、最先端のプロンプト手法と比較して顕著な競争力を示しています。

要約(オリジナル)

In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories’ diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a $79.28$ harmonic mean, averaged over 11 diverse image recognition datasets ($+7.62$ compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.

arxiv情報

著者	Yingjie Tian,Yiqi Wang,Xianda Guo,Zheng Zhu,Long Chen
発行日	2024-04-30 10:39:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Prompt with Depth Partitioned Cross-Modal Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー