Learning Video Context as Interleaved Multimodal Sequences

要約

映画などのナラティブビデオは、その豊富なコンテキスト (キャラクター、会話、ストーリーライン) と多様な要求 (誰、関係、および理由の特定) により、ビデオの理解に大きな課題をもたらします。
このペーパーでは、ビデオコンテキストを理解する際の幅広い課題に対処するために開発されたマルチモーダル言語モデルである MovieSeq を紹介します。
私たちの中心的なアイデアは、外部知識データベースをリンクするか、オフラインモデル (字幕のウィスパーなど) を使用することにより、ビデオをインターリーブされたマルチモーダルシーケンス (画像、プロット、ビデオ、字幕を含む) として表現することです。
このアプローチは、命令チューニングを通じて、インターリーブされたマルチモーダル命令を使用して言語モデルがビデオと対話できるようにします。
たとえば、入力としてビデオのみに依存するのではなく、キャラクターの写真と名前やセリフを共同で提供することで、モデルがこれらの要素を関連付けて、より包括的な応答を生成できるようにします。
その有効性を実証するために、6 つのデータセット (LVU、MAD、Movienet、CMD、TVC、MovieQA) の 5 つの設定 (ビデオ分類、音声説明、ビデオテキスト取得、ビデオキャプション、およびビデオ質問応答) にわたる MovieSeq のパフォーマンスを検証します。
コードは https://github.com/showlab/MovieSeq で公開されます。

要約(オリジナル)

Narrative videos, such as movies, pose significant challenges in video understanding due to their rich contexts (characters, dialogues, storylines) and diverse demands (identify who, relationship, and reason). In this paper, we introduce MovieSeq, a multimodal language model developed to address the wide range of challenges in understanding video contexts. Our core idea is to represent videos as interleaved multimodal sequences (including images, plots, videos, and subtitles), either by linking external knowledge databases or using offline models (such as whisper for subtitles). Through instruction-tuning, this approach empowers the language model to interact with videos using interleaved multimodal instructions. For example, instead of solely relying on video as input, we jointly provide character photos alongside their names and dialogues, allowing the model to associate these elements and generate more comprehensive responses. To demonstrate its effectiveness, we validate MovieSeq’s performance on six datasets (LVU, MAD, Movienet, CMD, TVC, MovieQA) across five settings (video classification, audio description, video-text retrieval, video captioning, and video question-answering). The code will be public at https://github.com/showlab/MovieSeq.

arxiv情報

著者	Kevin Qinghong Lin,Pengchuan Zhang,Difei Gao,Xide Xia,Joya Chen,Ziteng Gao,Jinheng Xie,Xuhong Xiao,Mike Zheng Shou
発行日	2024-09-12 14:01:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Video Context as Interleaved Multimodal Sequences

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー