Apollo: An Exploration of Video Understanding in Large Multimodal Models

要約

ビデオ認識機能は大規模マルチモーダルモデル (LMM) に急速に統合されていますが、ビデオ理解を促進する基礎的なメカニズムは依然としてよく理解されていません。
その結果、この分野における多くの設計上の決定は、適切な正当化や分析なしに行われます。
このようなモデルのトレーニングと評価にかかる計算コストが高く、公開研究が限られているため、ビデオ LMM の開発が妨げられています。
これに対処するために、LMM におけるビデオ理解を効果的に促進するものを明らかにするのに役立つ包括的な研究を紹介します。
私たちは、ビデオ LMM 研究に関連する高い計算要件の主な原因を批判的に調べることから始め、スケーリングの一貫性を発見します。これにより、より小さなモデルとデータセット (臨界サイズまで) で行われた設計とトレーニングの決定が、より大きなモデルに効果的に移行されます。
これらの洞察を活用して、ビデオサンプリング、アーキテクチャ、データ構成、トレーニングスケジュールなど、ビデオ LMM のビデオ固有の多くの側面を調査しました。
たとえば、トレーニング中の fps サンプリングが均一なフレームサンプリングよりもはるかに好ましいこと、およびどのビジョンエンコーダがビデオ表現に最適であるかを実証しました。
これらの発見に基づいて、さまざまなモデルサイズにわたって優れたパフォーマンスを実現する最先端の LMM ファミリである Apollo を紹介します。
当社のモデルは 1 時間のビデオを効率的に認識でき、Apollo-3B は LongVideoBench で 55.1 という驚異的なパフォーマンスを示し、既存のほとんどの 7$B モデルを上回っています。
Apollo-7B は、MLVU で 70.9、Video-MME で 63.3 という 7B LMM と比較して最先端です。

要約(オリジナル)

Despite the rapid integration of video perception capabilities into Large Multimodal Models (LMMs), the underlying mechanisms driving their video understanding remain poorly understood. Consequently, many design decisions in this domain are made without proper justification or analysis. The high computational cost of training and evaluating such models, coupled with limited open research, hinders the development of video-LMMs. To address this, we present a comprehensive study that helps uncover what effectively drives video understanding in LMMs. We begin by critically examining the primary contributors to the high computational requirements associated with video-LMM research and discover Scaling Consistency, wherein design and training decisions made on smaller models and datasets (up to a critical size) effectively transfer to larger models. Leveraging these insights, we explored many video-specific aspects of video-LMMs, including video sampling, architectures, data composition, training schedules, and more. For example, we demonstrated that fps sampling during training is vastly preferable to uniform frame sampling and which vision encoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family of LMMs that achieve superior performance across different model sizes. Our models can perceive hour-long videos efficiently, with Apollo-3B outperforming most existing $7$B models with an impressive 55.1 on LongVideoBench. Apollo-7B is state-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 on Video-MME.

arxiv情報

著者	Orr Zohar,Xiaohan Wang,Yann Dubois,Nikhil Mehta,Tong Xiao,Philippe Hansen-Estruch,Licheng Yu,Xiaofang Wang,Felix Juefei-Xu,Ning Zhang,Serena Yeung-Levy,Xide Xia
発行日	2024-12-13 18:53:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Apollo: An Exploration of Video Understanding in Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー