Motion Sensitive Contrastive Learning for Self-supervised Video Representation

要約

対照的学習は、ビデオ表現学習において大きな可能性を示しています。
ただし、既存のアプローチでは、さまざまな下流のビデオ理解タスクに不可欠な短期的なモーションダイナミクスを十分に活用できません。
この論文では、オプティカルフローによってキャプチャされた動き情報をRGBフレームに注入して特徴学習を強化するMotion Sensitive Contrastive Learning（MSCL）を提案します。
これを達成するために、クリップレベルのグローバルな対照学習に加えて、2 つのモダリティにわたるフレームレベルの対照目標を備えたローカルモーション対照学習 (LMCL) を開発します。
さらに、Flow Rotation Augmentation (FRA) を導入して追加のモーションシャッフルネガティブサンプルを生成し、Motion Differential Sampling (MDS) を導入してトレーニングサンプルを正確に選別します。
標準ベンチマークでの広範な実験により、提案された方法の有効性が検証されます。
一般的に使用されている 3D ResNet-18 をバックボーンとして使用することで、ビデオ分類の UCF101 で 91.5\%、Something-Something v2 で 50.3\% のトップ 1 精度を達成し、UCF101 で 65.6\% のトップ 1 リコールを達成しました。
ビデオ検索用に、特に最先端技術を向上させます。

要約(オリジナル)

Contrastive learning has shown great potential in video representation learning. However, existing approaches fail to sufficiently exploit short-term motion dynamics, which are crucial to various down-stream video understanding tasks. In this paper, we propose Motion Sensitive Contrastive Learning (MSCL) that injects the motion information captured by optical flows into RGB frames to strengthen feature learning. To achieve this, in addition to clip-level global contrastive learning, we develop Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities. Moreover, we introduce Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples. Extensive experiments on standard benchmarks validate the effectiveness of the proposed method. With the commonly-used 3D ResNet-18 as the backbone, we achieve the top-1 accuracies of 91.5\% on UCF101 and 50.3\% on Something-Something v2 for video classification, and a 65.6\% Top-1 Recall on UCF101 for video retrieval, notably improving the state-of-the-art.

arxiv情報

著者	Jingcheng Ni,Nan Zhou,Jie Qin,Qian Wu,Junqi Liu,Boxun Li,Di Huang
発行日	2022-08-12 04:06:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Motion Sensitive Contrastive Learning for Self-supervised Video Representation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー