Learning Video Representations without Natural Videos

要約

この論文では、トレーニングに自然ビデオを組み込むことなく、合成ビデオと自然画像から有用なビデオ表現を学習できることを示します。
私たちは、単純な生成プロセスによって合成された一連のビデオデータセットを提案します。これは、増加する一連の自然なビデオプロパティ (動き、加速度、形状変換など) をモデル化します。
これらの生成されたデータセットで事前トレーニングされたビデオモデルのダウンストリームパフォーマンスは、データセットの進行とともに徐々に向上します。
合成ビデオで事前トレーニングされた VideoMAE モデルは、ゼロからのトレーニングと自然ビデオからの自己教師付き事前トレーニングの間の UCF101 アクション分類のパフォーマンスギャップの 97.2% を埋め、HMDB51 で事前トレーニングされたモデルを上回ります。
静止画像の一部をトレーニング前段階に導入すると、UCF101 の事前トレーニングと同様のパフォーマンスが得られ、UCF101-P の分布外データセット 14 個のうち 11 個で UCF101 の事前トレーニング済みモデルを上回るパフォーマンスが得られます。
データセットの低レベルのプロパティを分析することで、フレームの多様性、自然データとのフレームの類似性、およびダウンストリームのパフォーマンスの間の相関関係を特定します。
私たちのアプローチは、事前トレーニングのためのビデオデータキュレーションプロセスに代わる、より制御可能で透明性の高い代替手段を提供します。

要約(オリジナル)

In this paper, we show that useful video representations can be learned from synthetic videos and natural images, without incorporating natural videos in the training. We propose a progression of video datasets synthesized by simple generative processes, that model a growing set of natural video properties (e.g. motion, acceleration, and shape transformations). The downstream performance of video models pre-trained on these generated datasets gradually increases with the dataset progression. A VideoMAE model pre-trained on our synthetic videos closes 97.2% of the performance gap on UCF101 action classification between training from scratch and self-supervised pre-training from natural videos, and outperforms the pre-trained model on HMDB51. Introducing crops of static images to the pre-training stage results in similar performance to UCF101 pre-training and outperforms the UCF101 pre-trained model on 11 out of 14 out-of-distribution datasets of UCF101-P. Analyzing the low-level properties of the datasets, we identify correlations between frame diversity, frame similarity to natural data, and downstream performance. Our approach provides a more controllable and transparent alternative to video data curation processes for pre-training.

arxiv情報

著者	Xueyang Yu,Xinlei Chen,Yossi Gandelsman
発行日	2024-10-31 17:59:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Video Representations without Natural Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー