Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

要約

自己教師あり学習では、注釈が不要なため、事前トレーニングを数十億の画像にスケールアップできる可能性が解き放たれました。
しかし、私たちはデータを最大限に活用できているでしょうか?
どれだけ経済的にできるでしょうか?
この研究では、2 つの貢献を行うことでこの質問に答えようとします。
まず、一人称ビデオを調査し、「ウォーキングツアー」データセットを導入します。
これらのビデオは高解像度で何時間にもわたって、中断のない 1 回のテイクでキャプチャされ、自然なシーンの遷移で多数のオブジェクトやアクションが描写されています。
これらはラベルも管理もされていないため、自己監視には現実的であり、人間の学習に匹敵します。
2 番目に、連続ビデオからの学習に合わせた新しい自己教師あり画像事前トレーニング方法を紹介します。
既存の方法は通常、画像ベースの事前トレーニング手法を適応させて、より多くのフレームを組み込んでいます。
代わりに、私たちは「認識することを学ぶための追跡」アプローチを提唱しています。
DoRA と呼ばれる私たちの手法は、トランスフォーマークロスアテンションを使用して、時間をかけてエンドツーエンドの方法でオブジェクトを発見し、tRAck するアテンションマップを生成します。
トラックから複数のビューを取得し、それらを古典的な自己監視蒸留損失に使用します。
私たちの新しいアプローチを使用すると、単一のウォーキングツアービデオが、いくつかの画像およびビデオのダウンストリームタスクにおいて、ImageNet の強力な競合相手になります。

要約(オリジナル)

Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a ‘Walking Tours’ dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a ‘tracking to learn to recognize’ approach. Our method called DoRA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.

arxiv情報

著者	Shashanka Venkataramanan,Mamshad Nayeem Rizve,João Carreira,Yuki M. Asano,Yannis Avrithis
発行日	2023-10-12 17:59:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー