Cross-Modal Adapter for Text-Video Retrieval

要約

テキストとビデオの検索は、重要なマルチモーダル学習タスクです。目標は、特定のテキストクエリに最も関連性の高いビデオを検索することです。
最近では、事前にトレーニングされたモデル (CLIP など) が、このタスクで大きな可能性を示しています。
ただし、事前トレーニング済みのモデルがスケールアップしているため、テキストとビデオの検索データセットでモデルを完全に微調整すると、オーバーフィッティングのリスクが高くなります。
さらに、実際には、タスクごとに大規模なモデルをトレーニングして保存するにはコストがかかります。
上記の問題を克服するために、パラメータを効率的に微調整するための新しい $\textbf{Cross-Modal Adapter}$ を提示します。
アダプターベースの方法に着想を得て、事前トレーニング済みのモデルをいくつかのパラメーター化レイヤーで調整します。
ただし、注目すべき違いが 2 つあります。
まず、私たちの方法はマルチモーダルドメイン用に設計されています。
第 2 に、CLIP の 2 つのエンコーダー間の初期のクロスモーダルインタラクションを可能にします。
驚くほど単純ですが、私たちのアプローチには 3 つの顕著な利点があります。(1) $\textbf{99.6}\%$ の微調整されたパラメーターを削減し、オーバーフィッティングの問題を軽減します。(2) トレーニング時間を約 30% 節約します。(
3) 事前トレーニング済みのすべてのパラメーターを固定できるため、事前トレーニング済みのモデルをデータセット間で共有できます。
MSR-VTT、MSVD、VATEX、ActivityNet、および DiDeMo データセットで完全に微調整された方法と比較して、追加機能なしで優れた、または同等のパフォーマンスを達成することが広範な実験によって実証されています。
コードは \url{https://github.com/LeapLabTHU/Cross-Modal-Adapter} で入手できます。

要約(オリジナル)

Text-video retrieval is an important multi-modal learning task, where the goal is to retrieve the most relevant video for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on this task. However, as pre-trained models are scaling up, fully fine-tuning them on text-video retrieval datasets has a high risk of overfitting. Moreover, in practice, it would be costly to train and store a large model for each task. To overcome the above issues, we present a novel $\textbf{Cross-Modal Adapter}$ for parameter-efficient fine-tuning. Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers. However, there are two notable differences. First, our method is designed for the multi-modal domain. Secondly, it allows early cross-modal interactions between CLIP’s two encoders. Although surprisingly simple, our approach has three notable benefits: (1) reduces $\textbf{99.6}\%$ of fine-tuned parameters, and alleviates the problem of overfitting, (2) saves approximately 30% of training time, and (3) allows all the pre-trained parameters to be fixed, enabling the pre-trained model to be shared across datasets. Extensive experiments demonstrate that, without bells and whistles, it achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets. The code will be available at \url{https://github.com/LeapLabTHU/Cross-Modal-Adapter}.

arxiv情報

著者	Haojun Jiang,Jianke Zhang,Rui Huang,Chunjiang Ge,Zanlin Ni,Jiwen Lu,Jie Zhou,Shiji Song,Gao Huang
発行日	2022-11-17 16:15:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cross-Modal Adapter for Text-Video Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー