FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

要約

この論文では、セマンティックなビデオ表現を学習するための自己教師ありのアプローチを示します。
最近の視覚研究では、視覚と自然言語による監視のためのマスキング戦略が、転移可能な視覚事前訓練の開発に貢献していることが示されています。
私たちの目標は、完全に自己教師付きの方法で事前トレーニング中にビデオコンテンツに関連するテキストを活用することで、よりセマンティックなビデオ表現を実現することです。
この目的を達成するために、セマンティック言語空間 (FILS) における新しい自己教師ありビデオ特徴予測である FILS を紹介します。
ビジョンモデルは、言語空間でマスクされた特徴のセマンティクスを正確に予測することで、貴重な構造化情報を取得できます。
これは、パッチごとのビデオとテキストの対比戦略を使用して学習されます。この戦略では、テキスト表現が視覚特徴を言語空間に変換するためのプロトタイプとして機能し、その後、マスクされたエンコーダー/デコーダー構造を使用して意味的に意味のある特徴予測のターゲットとして使用されます。
FILS は、下流の行動認識タスクで顕著な転送可能性を実証し、ViT-Base を使用して、Epic-Kitchens、Something-SomethingV2、Charades-Ego、EGTEA などの困難な自己中心的なデータセットで最先端の技術を実現します。
私たちの効率的な方法では、以前の研究と比較して、必要な計算量とバッチが少なくなります。

要約(オリジナル)

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

arxiv情報

著者	Mona Ahmadian,Frank Guerin,Andrew Gilbert
発行日	2024-06-05 16:44:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー