HierVL: Learning Hierarchical Video-Language Embeddings

要約

ビデオ言語埋め込みは視覚表現に意味を付与するための有望な手法であるが、既存の手法では数秒のビデオクリップとそれに付随するテキストとの間の短期的な関連性しか捉えることができない。我々は、長期的な関連性と短期的な関連性を同時に考慮した新しい階層的ビデオ言語埋め込み手法であるHierVLを提案する。学習データとして、Ego4Dで利用可能なように、人間の行動に関するタイムスタンプ付きのテキスト説明と、長いビデオ中の行動の高レベルなテキスト要約を伴うビデオを用いる。我々は、クリップレベルとビデオレベルの両方において、テキストと視覚の整合を促す階層的な対照学習目的を導入する。クリップレベルの制約は、その瞬間に何が起こっているかを捉えるためにステップバイステップの説明を使用する一方、ビデオレベルの制約は、なぜそれが起こっているか、すなわち、活動のためのより広い文脈と行為者の意図を捉えるために要約テキストを使用します。この階層的スキームにより、単一レベルのものより優れたクリップ表現と、長期的なビデオモデリングを必要とするタスクにおいてSotAを達成する長期的なビデオ表現が実現される。HierVLは、ゼロショットと微調整された設定の両方で、複数の困難な下流タスク（EPIC-KITCHENS-100, Charades-Ego, HowTo100M）へうまく移行することが可能である。

要約(オリジナル)

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

arxiv情報

著者	Kumar Ashutosh,Rohit Girdhar,Lorenzo Torresani,Kristen Grauman
発行日	2023-01-05 21:53:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

HierVL: Learning Hierarchical Video-Language Embeddings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー