AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

要約

事前に学習された視覚言語モデル（VLM）は、様々な視覚分類タスクにおいて素晴らしい結果を示してきた。しかし、新しい概念理解のためにVLMを適応させる場合、新しいクラスに関する情報が限られているため、VLMの潜在能力を十分に引き出すことができないことが多い。この限界に対処するために、我々は新しい適応フレームワークAWT（Augment, Weight, then Transport）を導入する。AWTは3つの主要なコンポーネントから構成される。すなわち、画像変換と言語モデルを通じて、多様な視覚的視点と充実したクラス記述で入力を補強すること、予測エントロピーに基づいて動的に入力を重み付けすること、そして視覚-言語空間における意味的相関をマイニングするために最適なトランスポートを採用することである。AWTは様々なVLMにシームレスに統合可能であり、追加トレーニングなしでゼロショット能力を強化し、統合されたマルチモーダルアダプタモジュールにより少数ショット学習を容易にする。我々は、ゼロショットおよび少数ショットの画像分類、ゼロショットビデオアクション認識、分布外汎化を含む、複数の困難なシナリオにおいてAWTを検証した。AWTは各環境において、一貫して最先端の手法を凌駕しています。さらに、我々の広範な研究は、異なるVLM、アーキテクチャ、スケールに渡るAWTの有効性と適応性をさらに実証している。

要約(オリジナル)

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT’s effectiveness and adaptability across different VLMs, architectures, and scales.

arxiv情報

著者	Yuhan Zhu,Yuyang Ji,Zhiyu Zhao,Gangshan Wu,Limin Wang
発行日	2024-07-05 15:52:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー