VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

要約

人間の毎日の活動は、ビデオストリームの日常的なイベントのシーケンス（例えば、アラームの電源を切る）として簡潔にナレーションすることができ、イベントの語彙を形成します。
これに動機付けられて、ビデオナレーションを語彙として定義する新しいビデオ理解フレームワークであるVlogを紹介し、既存の生成ビデオ言語モデルの典型的なサブワード語彙を超えています。
軽量言語モデルのGPT-2に基づいて構築されたVLOGは、3つの重要な革新を備えています。（i）生成的検索モデル、言語モデルの複雑な推論機能と、対照的な検索の柔軟なアップグレードをナレーション語の語彙に導きます。
（ii）ナレーションペアをエンコードするアルゴリズムを使用して、大規模なビデオナレーションから派生した階層的な語彙。特定のイベントの効率的なインデックス作成（トマトを切断するなど）を可能にします（例えば、キッチンなど）（左手で）
（iii）推論中に遭遇した新しいイベントの語彙を拡張するための生成モデルを活用する語彙更新戦略。
アプローチを検証するために、Vidcap-Evalを導入します。これは、推論関係（例：前後）の簡潔なナレーションを必要とする開発セットです。
エゴケマ、コイン、およびハイエストに関する実験は、VLOGの有効性をさらに示し、簡潔で文脈的に正確で効率的なナレーションを生成する能力を強調し、ビデオ理解に関する斬新な視点を提供します。
コードはhttps://github.com/showlab/vlogでリリースされます。

要約(オリジナル)

Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce VLog, a novel video understanding framework that define video narrations as vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, VLog feature three key innovations: (i) A generative retrieval model, marrying language model’s complex reasoning capabilities with contrastive retrieval’s flexible upgrading over narration vocabulary. (ii) A hierarchical vocabulary derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand). (iii) A vocabulary update strategy leveraging generative models to extend the vocabulary for novel events encountered during inference. To validate our approach, we introduce VidCap-Eval, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on EgoSchema, COIN, and HiREST further demonstrate the effectiveness of VLog, highlighting its ability to generate concise, contextually accurate, and efficient narrations, offering a novel perspective on video understanding. Codes are released at https://github.com/showlab/VLog.

arxiv情報

著者	Kevin Qinghong Lin,Mike Zheng Shou
発行日	2025-06-09 16:24:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー