InternVideo: General Video Foundation Models via Generative and Discriminative Learning

要約

基礎モデルは近年、コンピュータビジョンにおける様々な下流タスクで優れた性能を示している。しかし、既存のビジョン基盤モデルの多くは、単に画像レベルの事前学習と付加に焦点を当てたものであり、動的で複雑なビデオレベルの理解タスクには限界がある。このギャップを埋めるため、我々は生成的かつ識別的な自己教師付きビデオ学習の両方を活用することで、一般的なビデオ基礎モデルInternVideoを発表する。具体的には、InternVideoは事前学習としてマスク映像モデリングと映像言語対照学習を効率的に探索し、様々な映像アプリケーションを後押しするために、学習可能な方法でこれら二つの補完的な枠組みの映像表現を選択的に調整する。InternVideoは、ビデオアクション認識・検出、ビデオ言語アライメント、オープンワールドビデオアプリケーションを含む広範なタスクからなる39のビデオデータセットにおいて最先端の性能を達成する。特に、我々の手法は、難易度の高いKinetics-400とSomething-Something V2ベンチマークにおいて、それぞれ91.1%と77.2%のトップ1精度を得ることができます。これらの結果はすべて、ビデオ理解のための我々のInternVideoの一般性を効果的に示しています。コードは、https://github.com/OpenGVLab/InternVideo で公開される予定です。

要約(オリジナル)

The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

arxiv情報

著者	Yi Wang,Kunchang Li,Yizhuo Li,Yinan He,Bingkun Huang,Zhiyu Zhao,Hongjie Zhang,Jilan Xu,Yi Liu,Zun Wang,Sen Xing,Guo Chen,Junting Pan,Jiashuo Yu,Yali Wang,Limin Wang,Yu Qiao
発行日	2022-12-06 18:09:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー