Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

要約

Masked Autoencoders (MAE) は、入力画像パッチと再構成損失をランダムにマスキングすることにより、自己教師あり表現を学習します。
あるいは、対照学習の自己教師ありメソッドは、異なる入力の表現を引き離しながら、同じ入力の 2 つのバージョンが同様の表現を持つようにします。
ViC-MAE を提案します。これは、MAE 再構成目標の下で学習した局所特徴表現をプールし、ビデオフレーム全体の対照目標の下でこのグローバル表現を活用することにより、MAE と対照学習の両方を組み合わせた一般的な方法です。
ViC-MAE で学習した視覚的表現が、ビデオ分類タスクと画像分類タスクの両方に一般化されることを示します。
Moments in Time (MiT) データセットで事前トレーニングされたバックボーン ViT-B/16 ネットワークを使用して、絶対トップ 1 で 1.58% を改善することにより、ビデオから Imagenet-1k 上の画像への最先端の転移学習を取得します。
最近の以前の仕事からの精度。
さらに、私たちの方法は、Kinetics-400 ビデオ分類ベンチマークで 81.50% のトップ 1 精度という競争力のある転送学習パフォーマンスを維持します。
さらに、その単純さにもかかわらず、ViC-MAE は、MAE 事前トレーニングを VicReg や SiamSiam などの以前に提案された対照的な目標と組み合わせた場合と比較して、改善された結果をもたらすことを示しています。

要約(オリジナル)

Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.

arxiv情報

著者	Jefferson Hernandez,Ruben Villegas,Vicente Ordonez
発行日	2023-03-21 16:33:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Representation Learning from Unlabeled Video using Contrastive Masked Autoencoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー