GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval

要約

少数のショットクロスモーダル検索は、限られたトレーニングサンプルを備えたクロスモーダル表現の学習に焦点を当てており、モデルが推論中に目に見えないクラスを処理できるようにします。
トレーニングとテストの両方のデータが同じクラス分布を共有すると仮定する従来のクロスモーダル検索タスクとは異なり、少ないショット検索には、モダリティ全体のまばらな表現を持つデータが含まれます。
既存の方法は、少数のショットクロスモーダルデータのマルチピーク分布を適切にモデル化できないことが多く、潜在的なセマンティックスペースに2つの主要なバイアスが生じます。まばらなサンプルがクラス内の多様性をキャプチャできないモーダルバイアスと、画像とテキストの分布がセマンティックギャップを促進する間の誤った格付けが存在するモーダルバイアスです。
これらのバイアスは、検索の精度を妨げます。
これらの問題に対処するために、少数のショットクロスモーダル検索のための新しい方法であるGCRDPを提案します。
このアプローチは、ガウス混合モデル（GMM）を使用してデータの複雑なマルチピーク分布を効果的にキャプチャし、包括的な特徴モデリングのための多陽性サンプル対照学習メカニズムを組み込んでいます。
さらに、画像とテキストの特徴分布の相対的な距離を制限するクロスモーダルセマンティックアラインメントの新しい戦略を導入し、それによってクロスモーダル表現の精度を改善します。
4つのベンチマークデータセットでの広範な実験を通じてアプローチを検証し、6つの最先端の方法よりも優れたパフォーマンスを実証します。

要約(オリジナル)

Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples, enabling the model to handle unseen classes during inference. Unlike traditional cross-modal retrieval tasks, which assume that both training and testing data share the same class distribution, few-shot retrieval involves data with sparse representations across modalities. Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data, resulting in two main biases in the latent semantic space: intra-modal bias, where sparse samples fail to capture intra-class diversity, and inter-modal bias, where misalignments between image and text distributions exacerbate the semantic gap. These biases hinder retrieval accuracy. To address these issues, we propose a novel method, GCRDP, for few-shot cross-modal retrieval. This approach effectively captures the complex multi-peak distribution of data using a Gaussian Mixture Model (GMM) and incorporates a multi-positive sample contrastive learning mechanism for comprehensive feature modeling. Additionally, we introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions, thereby improving the accuracy of cross-modal representations. We validate our approach through extensive experiments on four benchmark datasets, demonstrating superior performance over six state-of-the-art methods.

arxiv情報

著者	Chengsong Sun,Weiping Li,Xiang Li,Yuankun Liu,Lianlei Shan
発行日	2025-05-19 16:25:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GMM-Based Comprehensive Feature Extraction and Relative Distance Preservation For Few-Shot Cross-Modal Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー