Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

要約

この論文では、実世界のアプリケーションでの対比表現学習のための大規模なラベルなしビデオデータセットを取得するという課題に取り組みます。
我々は、ビデオ内の異なるモダリティを組み合わせて拡張サンプルを生成する、Cross-Modal Manifold Cutmix (CMMC) と呼ばれる、自己教師あり学習のための新しいビデオ拡張技術を紹介します。
特徴空間内の 2 つのモダリティにわたってビデオテッセラクトを別のテッセラクトに埋め込むことにより、私たちの方法は学習されたビデオ表現の品質を向上させます。
私たちは、アクション認識とビデオ検索タスクのために、2 つの小規模ビデオデータセット、UCF101 と HMDB51 に対して広範な実験を実行します。
私たちのアプローチは、ドメイン知識が限られている NTU データセットに対しても効果的であることが示されています。
当社の CMMC は、両方の下流タスクで使用するトレーニングデータの量を減らしながら、他の自己教師あり手法と同等のパフォーマンスを達成します。

要約(オリジナル)

In this paper, we address the challenge of obtaining large-scale unlabelled video datasets for contrastive representation learning in real-world applications. We present a novel video augmentation technique for self-supervised learning, called Cross-Modal Manifold Cutmix (CMMC), which generates augmented samples by combining different modalities in videos. By embedding a video tesseract into another across two modalities in the feature space, our method enhances the quality of learned video representations. We perform extensive experiments on two small-scale video datasets, UCF101 and HMDB51, for action recognition and video retrieval tasks. Our approach is also shown to be effective on the NTU dataset with limited domain knowledge. Our CMMC achieves comparable performance to other self-supervised methods while using less training data for both downstream tasks.

arxiv情報

著者	Srijan Das,Michael S. Ryoo
発行日	2023-07-26 14:49:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cross-modal Manifold Cutmix for Self-supervised Video Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー