Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

要約

ビジョン言語モデル (VLM) は、画像とテキストのカテゴリ名を組み合わせることで、ゼロショット画像分類に優れています。
事前トレーニング済み VLM の種類が拡大することで、特定のタスクに適した VLM を特定できる可能性が高まります。
したがって、有望なゼロショット画像分類戦略は、データセットの画像にアクセスせずにターゲットデータセットのテキストデータのみに依存して、VLM Zoo から最も適切な事前トレーニング済み VLM を選択することです。
この論文では、この言語のみの VLM 選択における VLM の能力を評価する際の 2 つの固有の課題を分析します。それは、「モダリティギャップ」です。2 つの異なるモダリティにわたる VLM の埋め込みの差であり、テキストが画像の代わりとして信頼性が低くなります。
「能力ギャップ」 — VLM の全体的なランキングとターゲットデータセットのランキング間の不一致で、モデルの一般的なパフォーマンスからデータセット固有のパフォーマンスを直接予測することを妨げます。
これら 2 つのギャップの悪影響を軽減するために、VLM Selection With gAp Bridging (SWAB) を提案します。
SWAB はまず、最適なトランスポートを採用して、トランスポートマトリックスを使用してオープンソースデータセットとターゲットデータセットの間の関連性を取得します。
次に、このマトリックスを使用して、VLM の有用な統計をオープンソースデータセットからターゲットデータセットに転送し、これら 2 つのギャップを埋め、VLM 選択のための VLM の容量推定を強化します。
さまざまな VLM と画像分類データセットにわたる実験により、SWAB の有効性が検証されています。

要約(オリジナル)

Vision Language Models (VLMs) excel in zero-shot image classification by pairing images with textual category names. The expanding variety of Pre-Trained VLMs enhances the likelihood of identifying a suitable VLM for specific tasks. Thus, a promising zero-shot image classification strategy is selecting the most appropriate Pre-Trained VLM from the VLM Zoo, relying solely on the text data of the target dataset without access to the dataset’s images. In this paper, we analyze two inherent challenges in assessing the ability of a VLM in this Language-Only VLM selection: the ‘Modality Gap’ — the disparity in VLM’s embeddings across two different modalities, making text a less reliable substitute for images; and the ‘Capability Gap’ — the discrepancy between the VLM’s overall ranking and its ranking for target dataset, hindering direct prediction of a model’s dataset-specific performance from its general performance. We propose VLM Selection With gAp Bridging (SWAB) to mitigate the negative impact of these two gaps. SWAB first adopts optimal transport to capture the relevance between open-source datasets and target dataset with a transportation matrix. It then uses this matrix to transfer useful statistics of VLMs from open-source datasets to the target dataset for bridging those two gaps and enhancing the VLM’s capacity estimation for VLM selection. Experiments across various VLMs and image classification datasets validate SWAB’s effectiveness.

arxiv情報

著者	Chao Yi,De-Chuan Zhan,Han-Jia Ye
発行日	2024-03-20 17:54:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridge the Modality and Capacity Gaps in Vision-Language Model Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー