iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

要約

ネットワークは、監視による積極的なガイダンスなしに、偽の気を散らすものから関連するパターンをふるいにかける必要があるため、自己監視を通じて視覚的表現を学習することは非常に困難な作業です。
これは、大量のデータ拡張、大規模なデータセット、および膨大な量の計算によって実現されます。
ビデオ自己監視学習（SSL）には、追加の課題があります。ビデオデータセットは通常、画像データセットほど大きくなく、計算は1桁大きく、オプティマイザーがふるいにかける必要のあるスプリアスパターンの量は数倍になります。
したがって、ビデオデータから自己監視表現を直接学習すると、パフォーマンスが最適化されない可能性があります。
これに対処するために、ビデオ表現学習フレームワークで、自己または言語の監視で事前トレーニングされた強力な画像ベースのモデルを利用することを提案します。これにより、モデルは、ビデオラベル付きデータに依存せずに強力な空間的および時間的情報を学習できます。
この目的のために、一般的なビデオベースのSSL設計と目的を変更して、ビデオエンコーダーが一般的なドメインでトレーニングされた画像ベースのモデルのセマンティックコンテンツを\textit{subsume}するように促します。
提案されたアルゴリズムは、はるかに効率的に学習し（つまり、エポックが少なく、バッチが少ない）、単一モダリティSSLメソッドの中で標準のダウンストリームタスクで新しい最先端のパフォーマンスを実現することが示されています。

要約(オリジナル)

Learning visual representations through self-supervision is an extremely challenging task as the network needs to sieve relevant patterns from spurious distractors without the active guidance provided by supervision. This is achieved through heavy data augmentation, large-scale datasets and prohibitive amounts of compute. Video self-supervised learning (SSL) suffers from added challenges: video datasets are typically not as large as image datasets, compute is an order of magnitude larger, and the amount of spurious patterns the optimizer has to sieve through is multiplied several fold. Thus, directly learning self-supervised representations from video data might result in sub-optimal performance. To address this, we propose to utilize a strong image-based model, pre-trained with self- or language supervision, in a video representation learning framework, enabling the model to learn strong spatial and temporal information without relying on the video labeled data. To this end, we modify the typical video-based SSL design and objective to encourage the video encoder to \textit{subsume} the semantic content of an image-based model trained on a general domain. The proposed algorithm is shown to learn much more efficiently (i.e. in less epochs and with a smaller batch) and results in a new state-of-the-art performance on standard downstream tasks among single-modality SSL methods.

arxiv情報

著者	Fatemeh Saleh,Fuwen Tan,Adrian Bulat,Georgios Tzimiropoulos,Brais Martinez
発行日	2022-06-16 17:42:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

iBoot: Image-bootstrapped Self-Supervised Video Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー