Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

要約

テスト時の適応により、ラベルのないテストサンプルを使用してモデルを多様なデータに一般化できるようになり、現実のシナリオでは重要な価値が得られます。
最近、研究者らはこの設定を高度な事前トレーニング済みビジョン言語モデル (VLM) に適用し、テスト時のプロンプト調整などのアプローチを開発して、実際の適用可能性をさらに拡張しました。
ただし、これらの方法は通常、単一のモダリティから VLM を適応させることだけに焦点を当てており、より多くのサンプルが処理されるにつれてタスク固有の知識を蓄積できません。
これに対処するために、マルチモダリティからタスク固有の知識を効果的に蓄積する、VLM の新しいテスト時適応アプローチである Dual Prototype Evolving (DPE) を導入します。
具体的には、テキストとビジュアルの 2 セットのプロトタイプを作成および進化させて、テスト中にターゲットクラスのより正確なマルチモーダル表現を徐々にキャプチャします。
さらに、一貫したマルチモーダル表現を促進するために、各テストサンプルに学習可能な残差を導入して最適化し、両方のモダリティからのプロトタイプを調整します。
15 のベンチマークデータセットに関する広範な実験結果は、私たちが提案した DPE が以前の最先端の方法を常に上回っていると同時に、競合する計算効率も示していることを示しています。
コードは https://github.com/zhangce01/DPE-CLIP で入手できます。

要約(オリジナル)

Test-time adaptation, which enables models to generalize to diverse data with unlabeled test samples, holds significant value in real-world scenarios. Recently, researchers have applied this setting to advanced pre-trained vision-language models (VLMs), developing approaches such as test-time prompt tuning to further extend their practical applicability. However, these methods typically focus solely on adapting VLMs from a single modality and fail to accumulate task-specific knowledge as more samples are processed. To address this, we introduce Dual Prototype Evolving (DPE), a novel test-time adaptation approach for VLMs that effectively accumulates task-specific knowledge from multi-modalities. Specifically, we create and evolve two sets of prototypes–textual and visual–to progressively capture more accurate multi-modal representations for target classes during test time. Moreover, to promote consistent multi-modal representations, we introduce and optimize learnable residuals for each test sample to align the prototypes from both modalities. Extensive experimental results on 15 benchmark datasets demonstrate that our proposed DPE consistently outperforms previous state-of-the-art methods while also exhibiting competitive computational efficiency. Code is available at https://github.com/zhangce01/DPE-CLIP.

arxiv情報

著者	Ce Zhang,Simon Stepputtis,Katia Sycara,Yaqi Xie
発行日	2024-10-16 17:59:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー