Unmasked Teacher: Towards Training-Efficient Video Foundation Models

要約

ビデオファンデーションモデル (VFM) は、計算コストが高く、データが不足しているため、限られた調査しか受けていません。
以前の VFM は Image Foundation Models (IFM) に依存しており、ビデオドメインへの移行で課題に直面していました。
VideoMAE は限られたデータから堅牢な ViT をトレーニングしましたが、その低レベルの再構築は収束の問題を引き起こし、高レベルのクロスモーダルアラインメントと競合します。
この論文では、既存の方法の利点を統合する、時間に敏感な VFM のトレーニング効率の高い方法を提案します。
データ効率を高めるために、低セマンティクスビデオトークンのほとんどをマスクしますが、マスクされていないトークンを選択的に IFM に合わせます。IFM は UnMasked Teacher (UMT) として機能します。
セマンティックガイダンスを提供することにより、私たちの方法はより迅速な収束とマルチモーダルの親しみやすさを可能にします。
プログレッシブな事前トレーニングフレームワークにより、私たちのモデルは、シーン関連、時間関連、複雑なビデオ言語の理解など、さまざまなタスクを処理できます。
32 個の A100 GPU で 6 日間の事前トレーニングにパブリックソースのみを使用して、スクラッチビルドの ViT-L/16 はさまざまなビデオタスクで最先端のパフォーマンスを実現します。
コードとモデルは https://github.com/OpenGVLab/unmasked_teacher でリリースされます。

要約(オリジナル)

Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.

arxiv情報

著者	Kunchang Li,Yali Wang,Yizhuo Li,Yi Wang,Yinan He,Limin Wang,Yu Qiao
発行日	2023-03-28 15:39:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー