InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

要約

ビデオ認識、ビデオテキストタスク、およびビデオ中心の対話において最先端の結果を達成するビデオ基盤モデル (ViFM) の新しいファミリーである InternVideo2 を紹介します。
私たちの中心となる設計は、マスクされたビデオモデリング、クロスモーダル対比学習、次のトークン予測を統合し、ビデオエンコーダーのサイズを 6B パラメーターにスケールアップする進歩的なトレーニングアプローチです。
データレベルでは、ビデオを意味的にセグメント化し、ビデオ、オーディオ、音声のキャプションを生成することで、時空間の一貫性を優先します。
これにより、ビデオとテキストの配置が改善されます。
広範な実験を通じて設計を検証し、60 を超えるビデオおよびオーディオタスクで優れたパフォーマンスを実証しました。
特に、私たちのモデルはさまざまなビデオ関連の対話や長いビデオ理解ベンチマークで他のモデルよりも優れており、より長いコンテキストを推論して理解する能力を強調しています。
コードとモデルは https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/ で入手できます。

要約(オリジナル)

We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

arxiv情報

著者	Yi Wang,Kunchang Li,Xinhao Li,Jiashuo Yu,Yinan He,Chenting Wang,Guo Chen,Baoqi Pei,Ziang Yan,Rongkun Zheng,Jilan Xu,Zun Wang,Yansong Shi,Tianxiang Jiang,Songze Li,Hongjie Zhang,Yifei Huang,Yu Qiao,Yali Wang,Limin Wang
発行日	2024-08-14 14:31:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー