Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

要約

少ないショット画像分類は、コンピュータービジョンの分野、特にデータスカース環境での重要な課題です。
既存の方法は通常、クリップなどの事前に訓練された視覚言語モデルに依存しています。
ただし、モダリティのギャップにより、これは共同埋め込みスペースに画像とテキスト機能の一貫性のない分布であり、クラスのプロトタイプとしてこれらの機能を直接使用すると、しばしば最適ではないパフォーマンスにつながります。
この問題に対処するために、新しいクロスモーダルマッピング（CMM）メソッドを提案します。
この方法は、グローバルに画像機能をテキスト機能空間に線形変換を介して揃え、トリプレットの損失を使用してローカル空間関係を最適化し、それによってクロスモーダルの一貫性を大幅に向上させます。
実験結果は、他の方法と比較して、CMMがトレーニングプロセスを簡素化し、より高い効率を示していることを示しています。
さらに、CMMは、バックボーンを部分的に微調整する方法と比較して、11のベンチマークデータセットで平均上位1の精度を1.06％改善し、4つの分布シフトデータセットで優れたパフォーマンスを発揮します。
特に、CMMは事前に訓練されたモデルのモダリティギャップを効果的に軽減し、テキスト機能が画像機能の効果的なクラスプロトタイプとして機能するようにするため、少数の学習に効率的で非常に一般化可能なソリューションを提供します。

要約(オリジナル)

Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-language models, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets compared to methods that partially fine-tune the backbone, and it performs excellently on 4 distribution shift datasets. Notably, CMM effectively mitigates the modality gap in pre-trained models, enabling text features to serve as effective class prototypes for image features, thus providing an efficient and highly generalizable solution for few-shot learning.

arxiv情報

著者	Xi Yang,Pai Peng,Wulin Xie,Xiaohuan Lu,Jie Wen
発行日	2025-04-16 15:07:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー