DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

要約

ビデオテキスト検索のための画像テキスト事前トレーニングモデルクリップのパラメーター効率の高い適応は、研究の顕著な分野です。
Clipは画像レベルのビジョン言語のマッチングに焦点を当てていますが、ビデオテキストの検索には、ビデオレベルでの包括的な理解が必要です。
画像レベルからビデオレベルへの転送において、ビジョン、言語、およびアライメントの3つの重要な矛盾が現れます。
ただし、既存の方法は主に言語と調整を無視しながら、視覚に焦点を当てています。
この論文では、視覚、言語、およびアライメントの矛盾の削減（Discovla）を提案します。これは、3つの矛盾すべてを同時に軽減します。
具体的には、画像レベルの機能とビデオレベルの機能を統合し、ビジョンと言語の両方の矛盾に効果的に取り組むためのImage-Video機能Fusionを紹介します。
さらに、擬似画像キャプションを生成して、細かい画像レベルのアラインメントを学習します。
アライメントの不一致を軽減するために、画像間アライメント蒸留を提案します。これは、画像レベルのアラインメント知識を活用してビデオレベルのアライメントを強化します。
広範な実験は、私たちのDiscovlaの優位性を示しています。
特に、Clip（VIT-B/16）を備えたMSRVTTでは、Discovlaは以前の方法をR@1で1.5％上回り、50.5％R@1の最終スコアに達しました。
このコードは、https：//github.com/lunarshen/dsicovlaで入手できます。

要約(オリジナル)

The parameter-efficient adaptation of the image-text pretraining model CLIP for video-text retrieval is a prominent area of research. While CLIP is focused on image-level vision-language matching, video-text retrieval demands comprehensive understanding at the video level. Three key discrepancies emerge in the transfer from image-level to video-level: vision, language, and alignment. However, existing methods mainly focus on vision while neglecting language and alignment. In this paper, we propose Discrepancy Reduction in Vision, Language, and Alignment (DiscoVLA), which simultaneously mitigates all three discrepancies. Specifically, we introduce Image-Video Features Fusion to integrate image-level and video-level features, effectively tackling both vision and language discrepancies. Additionally, we generate pseudo image captions to learn fine-grained image-level alignment. To mitigate alignment discrepancies, we propose Image-to-Video Alignment Distillation, which leverages image-level alignment knowledge to enhance video-level alignment. Extensive experiments demonstrate the superiority of our DiscoVLA. In particular, on MSRVTT with CLIP (ViT-B/16), DiscoVLA outperforms previous methods by 1.5% in R@1, reaching a final score of 50.5% R@1. The code is available at https://github.com/LunarShen/DsicoVLA.

arxiv情報

著者	Leqi Shen,Guoqiang Gong,Tianxiang Hao,Tao He,Yifeng Zhang,Pengzhang Liu,Sicheng Zhao,Jungong Han,Guiguang Ding
発行日	2025-06-10 15:16:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DiscoVLA: Discrepancy Reduction in Vision, Language, and Alignment for Parameter-Efficient Video-Text Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー