VideoLLM-online: Online Video Large Language Model for Streaming Video

要約

最近の大規模言語モデルはビジョン機能で強化されており、画像、ビデオ、およびインターリーブされたビジョン言語コンテンツを理解できるようになりました。
ただし、これらの大規模なマルチモーダルモデルの学習方法は通常、ビデオを事前に決定されたクリップとして扱うため、ストリーミングビデオ入力を処理する際の効果や効率が低くなります。
この論文では、連続ビデオストリーム内で時間的に調整された、長いコンテキストのリアルタイム会話を可能にする、新しいビデオストリーム内学習 (LIVE) フレームワークを提案します。
当社の LIVE フレームワークは、ビデオストリーミングダイアログを実現するための包括的なアプローチで構成されており、(1) 継続的なストリーミング入力の言語モデリングを実行するように設計されたトレーニング目標、(2) オフラインの一時的な注釈をストリーミングダイアログ形式に変換するデータ生成スキーム、および (
3) 最適化された推論パイプラインにより、現実世界のビデオストリームにおけるモデルの応答が高速化されます。
LIVE フレームワークを使用して、Llama-2/Llama-3 に基づいて VideoLLM オンラインモデルを構築し、ストリーミングビデオの処理におけるその大きな利点を実証しました。
たとえば、平均して、私たちのモデルは、A100 GPU で 10 FPS 以上の 5 分間のビデオクリップのストリーミングダイアログをサポートできます。
さらに、認識、キャプション、予測など、公開オフラインビデオベンチマークにおける最先端のパフォーマンスも紹介します。
コード、モデル、データ、デモは https://showlab.github.io/videollm-online で入手できます。

要約(オリジナル)

Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.

arxiv情報

著者	Joya Chen,Zhaoyang Lv,Shiwei Wu,Kevin Qinghong Lin,Chenan Song,Difei Gao,Jia-Wei Liu,Ziteng Gao,Dongxing Mao,Mike Zheng Shou
発行日	2024-06-17 17:55:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoLLM-online: Online Video Large Language Model for Streaming Video

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー