VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

要約

マルチモーダル大規模言語モデル (MLLM) の進歩により、マルチモーダルの理解が大幅に進歩し、ビデオコンテンツの分析能力が拡大しました。
しかし、MLLM の既存の評価ベンチマークは主に抽象的なビデオの理解に焦点を当てており、ビデオの構成を理解する能力や、高度に編集されたビデオコンテキスト内で視覚要素がどのように結合し相互作用するかについての微妙な解釈を理解する能力の詳細な評価が欠けています。
VidComposition は、慎重に厳選されたコンパイル済みビデオと映画レベルの注釈を使用して、MLLM のビデオ構成理解機能を評価するために特別に設計された新しいベンチマークです。
VidComposition には、カメラの動き、アングル、ショットサイズ、物語の構造、キャラクターのアクションや感情など、さまざまな構成要素をカバーする 1706 の多肢選択式質問を含む 982 本のビデオが含まれています。33 のオープンソースおよび独自の MLLM を総合的に評価したところ、顕著なパフォーマンスが明らかになりました。
人間とモデルの能力のギャップ。
これは、複雑でコンパイルされたビデオ構成を理解する際の現在の MLLM の限界を浮き彫りにし、さらなる改善の余地がある領域についての洞察を提供します。
リーダーボードと評価コードは https://yunlong10.github.io/VidComposition/ で入手できます。

要約(オリジナル)

The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at https://yunlong10.github.io/VidComposition/.

arxiv情報

著者	Yunlong Tang,Junjia Guo,Hang Hua,Susan Liang,Mingqian Feng,Xinyang Li,Rui Mao,Chao Huang,Jing Bi,Zeliang Zhang,Pooyan Fazli,Chenliang Xu
発行日	2024-11-19 17:46:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー