OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning

要約

Transductive Zero-Shot Learning（ZSL）は、セマンティッククラスの説明と非標識テストデータの分布の両方を活用することにより、目に見えないカテゴリを分類することを目的としています。
視覚入力をテキストセマンティクスに合わせるのにクリップエクセルなどのビジョン言語モデル（VLM）は、クラスレベルの事前にあまりにも依存しすぎており、きめ細かい視覚キューをキャプチャできません。
対照的に、Dinov2のようなVisionのみの基礎モデル（VFM）は、豊富な知覚的特徴を提供しますが、意味的なアライメントがありません。
これらのモデルの補完的な強みを活用するために、最適な輸送を介してVLMとVFMを橋渡しするシンプルで効果的なトレーニングフリーのフレームワークであるOtFusionを提案します。
具体的には、OtFusionは、それぞれの分布間の輸送コストを最小限に抑えることにより、視覚情報と意味情報を合わせる共有確率的表現を学ぶことを目的としています。
この統一された分布により、意味的に意味があり、視覚的に接地された一貫したクラスの予測が可能になります。
11のベンチマークデータセットでの広範な実験は、OTFusionが一貫して元のクリップモデルを上回ることを示しており、すべてが微調整または追加の注釈なしで、ほぼ10ドル\％$の平均精度改善を達成することを示しています。
コードは、論文が受け入れられた後に公開されます。

要約(オリジナル)

Transductive zero-shot learning (ZSL) aims to classify unseen categories by leveraging both semantic class descriptions and the distribution of unlabeled test data. While Vision-Language Models (VLMs) such as CLIP excel at aligning visual inputs with textual semantics, they often rely too heavily on class-level priors and fail to capture fine-grained visual cues. In contrast, Vision-only Foundation Models (VFMs) like DINOv2 provide rich perceptual features but lack semantic alignment. To exploit the complementary strengths of these models, we propose OTFusion, a simple yet effective training-free framework that bridges VLMs and VFMs via Optimal Transport. Specifically, OTFusion aims to learn a shared probabilistic representation that aligns visual and semantic information by minimizing the transport cost between their respective distributions. This unified distribution enables coherent class predictions that are both semantically meaningful and visually grounded. Extensive experiments on 11 benchmark datasets demonstrate that OTFusion consistently outperforms the original CLIP model, achieving an average accuracy improvement of nearly $10\%$, all without any fine-tuning or additional annotations. The code will be publicly released after the paper is accepted.

arxiv情報

著者	Qiyu Xu,Wenyang Chen,Zhanxuan Hu,Huafeng Li,Yonghang Tai
発行日	2025-06-16 17:27:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー