TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

要約

この研究では、長時間のビデオを理解するために特別に設計された、時間に敏感なマルチモーダル大規模言語モデルである TimeChat を提案します。
私たちのモデルには、2 つの重要なアーキテクチャ上の貢献が組み込まれています。(1) ビジュアルコンテンツを各フレームのタイムスタンプにバインドするタイムスタンプ対応フレームエンコーダー、および (2) ビデオに対応するためにさまざまな長さのビデオトークンシーケンスを生成するスライディングビデオ Q-Former
さまざまな期間の。
さらに、TimeChat の命令追従パフォーマンスをさらに強化するために、6 つのタスクと合計 125,000 のインスタンスを含む命令チューニングデータセットを構築しました。
高密度キャプション、時間的グラウンディング、ハイライト検出などのさまざまなビデオ理解タスクにわたる実験結果は、TimeChat の強力なゼロショット時間的位置特定と推論機能を実証しています。
たとえば、最先端のビデオと比較して、YouCook2 では +9.2 F1 スコアと +2.8 CIDEr、QVHighlights では +5.8 HIT@1、Charades-STA では +27.5 R@1 (IoU=0.5) を達成しています。
大規模な言語モデルは、長時間のビデオ理解タスクのための多用途のビデオアシスタントとして機能し、現実的なユーザーの要件を満たす可能性を秘めています。

要約(オリジナル)

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat’s instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat’s strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

arxiv情報

著者	Shuhuai Ren,Linli Yao,Shicheng Li,Xu Sun,Lu Hou
発行日	2024-03-28 12:41:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー