Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

要約

自然言語を使用して画像について推論する視覚言語システムの能力を示す刺激的な最近の結果にもかかわらず、ビデオ推論の能力はまだ研究されていません。
私たちは、少数のキーフレームを順番に理解することでビデオをフレーミングする推論を動機付け、それによってビデオ処理の計算の複雑さを軽減しながら、ビジョン言語の力と堅牢性を活用します。
この新しいアプリケーションを評価するために、ビデオ思考連鎖を通じてモデルの推論能力を調査するように設計された推論時間チャレンジデータセットである VIP を導入します。
視覚的に説明的なシーンプレイからインスピレーションを得て、キーフレームの説明に 2 つの形式を提案します。それは、キーフレームの焦点、アクション、気分、オブジェクト、設定 (FAMOuS) を識別する、非構造化の密なキャプションと構造化されたシーンの説明です。
ビデオ推論を評価するために、ビデオ埋め込みとビデオ予測の 2 つのタスクを提案します。これらのタスクは、それぞれ複数の中間キーフレームを生成し、将来のキーフレームを予測する能力をテストします。
VIP 上で GPT-4、GPT-3、および VICUNA のベンチマークを行い、これらの複雑なビデオ推論タスクにおけるパフォーマンスのギャップを実証し、効率的かつ一般化されたビデオ推論のための言語モデルを優先する将来の作業を奨励します。

要約(オリジナル)

Despite exciting recent results showing vision-language systems’ capacity to reason about images using natural language, their capacity for video reasoning remains under-explored. We motivate framing video reasoning as the sequential understanding of a small number of keyframes, thereby leveraging the power and robustness of vision-language while alleviating the computational complexities of processing videos. To evaluate this novel application, we introduce VIP, an inference-time challenge dataset designed to explore models’ reasoning capabilities through video chain-of-thought. Inspired by visually descriptive scene plays, we propose two formats for keyframe description: unstructured dense captions and structured scene descriptions that identify the focus, action, mood, objects, and setting (FAMOuS) of the keyframe. To evaluate video reasoning, we propose two tasks: Video Infilling and Video Prediction, which test abilities to generate multiple intermediate keyframes and predict future keyframes, respectively. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in these complex video reasoning tasks, and encourage future work to prioritize language models for efficient and generalized video reasoning.

arxiv情報

著者	Vaishnavi Himakunthala,Andy Ouyang,Daniel Rose,Ryan He,Alex Mei,Yujie Lu,Chinmay Sonar,Michael Saxon,William Yang Wang
発行日	2023-11-09 06:50:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー