CoS: Chain-of-Shot Prompting for Long Video Understanding

要約

マルチモーダルの大手言語モデル（MLLM）は、過度の視覚トークンが必要なため、長いビデオと格闘しています。
これらのトークンはMLLMのコンテキスト長を大幅に超えており、その結果、冗長なタスクに関係のあるショットによって満たされます。
ショットを選択する方法は未解決の重要な問題です。まばらなサンプリングのリスク重要な詳細がありませんが、徹底的なサンプリングは無関係なコンテンツでモデルを圧倒し、ビデオの誤解につながります。
この問題を解決するために、チェーンオブショットプロンプト（cos）を提案します。
重要なアイデアは、ショットの選択をテスト時間の視覚プロンプトの最適化としてフレーム化し、ショットタスクアライメントを最適化することにより、ビデオ理解セマンティックタスクに適応するショットを選択することです。
COSには2つの重要な部分があります。（1）擬似時間的接地を実行するバイナリビデオ要約メカニズム、タスク関連のショットを識別するためのバイナリコーディングを発見し、（2）ペアに合わせてバイナリコーディングを展開するビデオ共同再季節モジュール（学習モジュール
並べる）タスク関連のポジティブなショットは、無関係なネガティブショットを備えています。
最適化されたショットの選択を元のビデオに埋め込み、長いビデオ理解を最適化するための関連するコンテキストに焦点を当てます。
3つのベースラインと5つのデータセットにわたる実験は、COSの有効性と適応性を示しています。
https://lwpyh.github.io/cosで指定されたコード。

要約(オリジナル)

Multi-modal Large Language Models (MLLMs) struggle with long videos due to the need for excessive visual tokens. These tokens exceed massively the context length of MLLMs, resulting in filled by redundant task-irrelevant shots. How to select shots is an unsolved critical problem: sparse sampling risks missing key details, while exhaustive sampling overwhelms the model with irrelevant content, leading to video misunderstanding. To solve this problem, we propose Chain-of-Shot prompting (CoS). The key idea is to frame shot selection as test-time visual prompt optimisation, choosing shots adaptive to video understanding semantic task by optimising shots-task alignment. CoS has two key parts: (1) a binary video summary mechanism that performs pseudo temporal grounding, discovering a binary coding to identify task-relevant shots, and (2) a video co-reasoning module that deploys the binary coding to pair (learning to align) task-relevant positive shots with irrelevant negative shots. It embeds the optimised shot selections into the original video, facilitating a focus on relevant context to optimize long video understanding. Experiments across three baselines and five datasets demonstrate the effectiveness and adaptability of CoS. Code given in https://lwpyh.github.io/CoS.

arxiv情報

著者	Jian Hu,Zixu Cheng,Chenyang Si,Wei Li,Shaogang Gong
発行日	2025-02-11 14:59:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CoS: Chain-of-Shot Prompting for Long Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー