HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

要約

ビデオ言語の事前トレーニングにより、下流のさまざまなビデオ言語タスクのパフォーマンスが向上しました。
ただし、ほとんどの以前の方法は、典型的な画像言語の事前トレーニングパラダイムをビデオ言語の事前トレーニングに直接継承または適応させるため、ビデオの固有の特性、つまり一時的な特性を十分に活用していません。
このホワイトペーパーでは、瞬間とテキスト間のクロスモーダルアラインメント、およびビデオとテキストのペアの時間的関係をモデル化するための 2 つの新しい事前トレーニングタスクを備えた、階層型の一時的認識ビデオ言語事前トレーニングフレームワーク、HiTeA を提案します。
具体的には、ビデオの瞬間を探索するためのクロスモーダル瞬間探索タスクを提案します。これにより、詳細なビデオの瞬間表現が得られます。
さらに、固有の時間的関係は、マルチモーダル時間的関係探索タスクを使用して、ビデオとテキストのペア全体を異なる時間解像度で整列させることによってキャプチャされます。
さらに、シャフリングテストを導入して、データセットとビデオ言語の事前トレーニングモデルの時間的依存性を評価します。
確立された 15 のビデオ言語理解および生成タスク、特に時間指向のデータセット (SSv2-Template および SSv2-Label など) で最先端の結果を達成し、それぞれ 8.6% および 11.1% の改善を達成しました。
HiTeA は、下流のタスクにゼロショット方式で直接転送される場合にも、強力な汎化能力を発揮します。
モデルとデモは ModelScope で入手できます。

要約(オリジナル)

Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for modeling cross-modal alignment between moments and texts as well as the temporal relations of video-text pairs. Specifically, we propose a cross-modal moment exploration task to explore moments in videos, which results in detailed video moment representation. Besides, the inherent temporal relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. Models and demo will be available on ModelScope.

arxiv情報

著者	Qinghao Ye,Guohai Xu,Ming Yan,Haiyang Xu,Qi Qian,Ji Zhang,Fei Huang
発行日	2022-12-30 04:27:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー