StreamChat: Chatting with Streaming Video

要約

このペーパーでは、大規模マルチモーダルモデル (LMM) とストリーミングビデオコンテンツの対話機能を強化する新しいアプローチである StreamChat について説明します。
ストリーミングインタラクションシナリオでは、既存の手法は質問が投げかけられた瞬間に利用できる視覚情報のみに依存しているため、モデルがストリーミングビデオのその後の変更を認識できないため、大幅な遅延が発生します。
StreamChat は、各デコードステップでビジュアルコンテキストを革新的に更新することでこの制限に対処し、モデルがデコードプロセス全体を通じて最新のビデオコンテンツを確実に利用できるようにします。
さらに、ストリーミングインタラクションの推論効率を維持しながら、動的ストリーミング入力を処理する、柔軟で効率的なクロスアテンションベースのアーキテクチャを導入します。
さらに、ストリーミングインタラクションモデルのトレーニングを容易にする新しい高密度命令データセットを構築します。これは、ビジュアルトークンとテキストトークンの相対的な時間情報をエンコードする並列 3D-RoPE メカニズムによって補完されます。
実験結果は、StreamChat が確立された画像とビデオのベンチマークで競争力のあるパフォーマンスを達成し、最先端のビデオ LMM と比較してストリーミングインタラクションシナリオで優れた機能を発揮することを示しています。

要約(オリジナル)

This paper presents StreamChat, a novel approach that enhances the interaction capabilities of Large Multimodal Models (LMMs) with streaming video content. In streaming interaction scenarios, existing methods rely solely on visual information available at the moment a question is posed, resulting in significant delays as the model remains unaware of subsequent changes in the streaming video. StreamChat addresses this limitation by innovatively updating the visual context at each decoding step, ensuring that the model utilizes up-to-date video content throughout the decoding process. Additionally, we introduce a flexible and efficient crossattention-based architecture to process dynamic streaming inputs while maintaining inference efficiency for streaming interactions. Furthermore, we construct a new dense instruction dataset to facilitate the training of streaming interaction models, complemented by a parallel 3D-RoPE mechanism that encodes the relative temporal information of visual and text tokens. Experimental results demonstrate that StreamChat achieves competitive performance on established image and video benchmarks and exhibits superior capabilities in streaming interaction scenarios compared to state-of-the-art video LMM.

arxiv情報

著者	Jihao Liu,Zhiding Yu,Shiyi Lan,Shihao Wang,Rongyao Fang,Jan Kautz,Hongsheng Li,Jose M. Alvare
発行日	2024-12-11 18:59:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

StreamChat: Chatting with Streaming Video

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー