Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

要約

現在のマルチモーダル大規模言語モデル (MLLM) はビデオ理解において有望な結果を示していますが、非常に長いビデオを処理することは依然として課題です。
通常、MLLM は、LLM の最大コンテキスト長を超える数千のトークンの処理に苦労し、トークンの集約により視覚的な明瞭さが低下します。
もう 1 つの課題は、多数のビデオトークンに起因する高い計算コストです。
これらの問題に取り組むために、私たちは、時間スケールのビデオを効率的に理解するために設計された超長期ビジョン言語モデルである Video-XL を提案します。
具体的には、LLM を効果的な視覚的凝縮器として適応させ、視覚的コンテキストを非常にコンパクトな形式に凝縮する視覚的コンテキスト潜在要約を導入できると主張します。
広範な実験により、限られた画像データでトレーニングされているにもかかわらず、私たちのモデルが一般的な長時間ビデオ理解ベンチマークで有望な結果を達成していることが実証されています。
さらに、Video-XL は効率と有効性の間で有望なバランスを実現し、単一の 80GB GPU で 1024 フレームを処理しながら、Needle-in-a-Haystack の評価でほぼ 100% の精度を達成します。
私たちは、Video-XL がビデオの要約、監視の異常検出、広告配置の識別などの長いビデオアプリケーションにとって価値のあるツールになることを想定しています。

要約(オリジナル)

Although current Multi-modal Large Language Models (MLLMs) demonstrate promising results in video understanding, processing extremely long videos remains an ongoing challenge. Typically, MLLMs struggle with handling thousands of tokens that exceed the maximum context length of LLMs, and they experience reduced visual clarity due to token aggregation. Another challenge is the high computational cost stemming from the large number of video tokens. To tackle these issues, we propose Video-XL, an extra-long vision language model designed for efficient hour-scale video understanding. Specifically, we argue that LLMs can be adapted as effective visual condensers and introduce Visual Context Latent Summarization, which condenses visual contexts into highly compact forms. Extensive experiments demonstrate that our model achieves promising results on popular long video understanding benchmarks, despite being trained on limited image data. Moreover, Video-XL strikes a promising balance between efficiency and effectiveness, processing 1024 frames on a single 80GB GPU while achieving nearly 100\% accuracy in the Needle-in-a-Haystack evaluation. We envision Video-XL becoming a valuable tool for long video applications such as video summarization, surveillance anomaly detection, and Ad placement identification.

arxiv情報

著者	Yan Shu,Peitian Zhang,Zheng Liu,Minghao Qin,Junjie Zhou,Tiejun Huang,Bo Zhao
発行日	2024-09-24 14:59:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー