MSCI: Addressing CLIP’s Inherent Limitations for Compositional Zero-Shot Learning

要約

構成ゼロショット学習（CZSL）は、既知の組み合わせを活用することにより、目に見えない状態オブジェクトの組み合わせを認識することを目的としています。
既存の研究は基本的に、クリップのクロスモーダルアラインメント機能に依存していますが、建築とトレーニングのパラダイムから生じる細かい地域の特徴をキャプチャする際の制限を見落とす傾向があります。
この問題に対処するために、Clipの視覚エンコーダーから中間層情報を効果的に調査および利用するマルチステージクロスモーダルインタラクション（MSCI）モデルを提案します。
具体的には、2つの自己適応的なアグリゲーターを設計して、低レベルの視覚機能からローカル情報を抽出し、それぞれ高レベルの視覚機能からグローバル情報を統合します。
これらの重要な情報は、段階ごとの相互作用メカニズムを通じてテキスト表現に徐々に組み込まれ、微調整されたローカル視覚情報に対するモデルの認識能力を大幅に向上させます。
さらに、MSCIは、さまざまな組み合わせと同じ組み合わせ内のさまざまな要素に基づいて、グローバルとローカルの視覚情報の間の注意力を動的に調整し、多様なシナリオに柔軟に適応できるようにします。
広く使用されている3つのデータセットでの実験は、提案されたモデルの有効性と優位性を完全に検証します。
データとコードはhttps://github.com/ltpwy/msciで入手できます。

要約(オリジナル)

Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP’s visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model’s perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.

arxiv情報

著者	Yue Wang,Shuai Xu,Xuelin Zhu,Yicong Li
発行日	2025-05-15 13:36:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MSCI: Addressing CLIP’s Inherent Limitations for Compositional Zero-Shot Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー