HierVL: Learning Hierarchical Video-Language Embeddings

要約

ビデオ言語の埋め込みは、視覚表現にセマンティクスを注入するための有望な手段ですが、既存の方法では、数秒間のビデオクリップとそれに付随するテキストの間の短期間の関連性しか捕捉できません。
私たちは、長期と短期の両方の関連性を同時に説明する新しい階層型ビデオ言語埋め込みである HierVL を提案します。
トレーニングデータとして、人間の行動のタイムスタンプ付きテキスト説明を伴うビデオと、長いビデオ全体にわたるアクティビティの高レベルのテキスト概要 (Ego4D で利用可能) を撮影します。
クリップレベルとビデオレベルの両方でテキストと視覚の調整を促進する、階層的な対照的なトレーニング目標を導入します。
クリップレベルの制約はステップバイステップの説明を使用してその瞬間に何が起こっているかをキャプチャしますが、ビデオレベルの制約は概要テキストを使用してそれが起こっている理由、つまりアクティビティと意図のより広範なコンテキストをキャプチャします。
俳優の。
私たちの階層スキームは、単一レベルの対応物よりも優れたクリップ表現と、長期的なビデオモデリングを必要とするタスクで SotA の結果を達成する長期的なビデオ表現を生成します。
HierVL は、ゼロショット設定と微調整設定の両方で、複数の困難な下流タスク (EPIC-KITCHENS-100、Charades-Ego、HowTo100M) に正常に移行します。

要約(オリジナル)

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

arxiv情報

著者	Kumar Ashutosh,Rohit Girdhar,Lorenzo Torresani,Kristen Grauman
発行日	2023-06-08 14:29:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HierVL: Learning Hierarchical Video-Language Embeddings

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー