VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

要約

このペーパーでは、ビデオおよびオーディオ指向のタスクにおける時空間モデリングとオーディオ理解を強化するために設計されたビデオ大規模言語モデル (Video-LLM) のセットである VideoLLaMA 2 について紹介します。
VideoLLaMA 2 は、前モデルを基にして、ビデオデータの複雑な空間的および時間的ダイナミクスを効果的にキャプチャする、オーダーメイドの時空間コンボリューション (STC) コネクタを組み込んでいます。
さらに、共同トレーニングを通じてオーディオブランチをモデルに統合し、オーディオキューをシームレスに組み込むことでモデルのマルチモーダル理解機能を強化します。
多肢選択ビデオ質問応答 (MC-VQA)、自由形式ビデオ質問応答 (OE-VQA)、およびビデオキャプション (VC) タスクの包括的な評価により、VideoLLaMA 2 がオープンソースモデル間で一貫して競争力のある結果を達成し、さらには
いくつかのベンチマークでは、一部の独自モデルに近い結果を示しています。
さらに、VideoLLaMA 2 は、既存のモデルと比較して、オーディオのみおよびオーディオビデオ質問応答 (AQA および OE-AVQA) ベンチマークにおいて合理的な改善を示しています。
これらの進歩は、VideoLLaMA 2 のマルチモーダル理解における優れたパフォーマンスを強調し、インテリジェントなビデオ分析システムの新しい標準を確立します。
さらなる研究を促進するために、すべてのモデルが公開されています。

要約(オリジナル)

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2’s superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.

arxiv情報

著者	Zesen Cheng,Sicong Leng,Hang Zhang,Yifei Xin,Xin Li,Guanzheng Chen,Yongxin Zhu,Wenqi Zhang,Ziyang Luo,Deli Zhao,Lidong Bing
発行日	2024-06-11 17:22:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー