MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

要約

ビデオ理解における基礎モデルを評価するための、専門家レベルの包括的な複数分野のベンチマークである MMVU を紹介します。
MMVU には、科学、ヘルスケア、人文社会科学、エンジニアリングの 4 つの主要分野にわたる 27 科目にわたる専門家による注釈付きの質問が 3,000 件含まれています。
以前のベンチマークと比較して、MMVU には 3 つの重要な進歩があります。
まず、現在のビデオベンチマークで一般的に評価される基本的な視覚認識を超えて、ドメイン固有の知識を適用し、専門家レベルの推論を実行して専門ドメインのビデオを分析するモデルに課題を与えます。
第 2 に、各例には人間の専門家が最初から注釈を付けています。
データセットの高品質を保証するために、厳格なデータ品質管理を実施しています。
最後に、各例には専門家による注釈付きの推論根拠と関連分野の知識が充実しており、詳細な分析が容易になります。
私たちは、MMVU 上の 32 のフロンティアマルチモーダル基礎モデルの広範な評価を実施します。
最新の System-2 対応モデルである o1 および Gemini 2.0 Flash Thinking は、テストされたモデルの中で最高のパフォーマンスを実現します。
しかし、人間の専門知識にはまだ及ばない。
徹底したエラー分析とケーススタディを通じて、専門分野における専門家レベルの知識集約的なビデオ理解の将来の進歩に向けた実用的な洞察を提供します。

要約(オリジナル)

We introduce MMVU, a comprehensive expert-level, multi-discipline benchmark for evaluating foundation models in video understanding. MMVU includes 3,000 expert-annotated questions spanning 27 subjects across four core disciplines: Science, Healthcare, Humanities & Social Sciences, and Engineering. Compared to prior benchmarks, MMVU features three key advancements. First, it challenges models to apply domain-specific knowledge and perform expert-level reasoning to analyze specialized-domain videos, moving beyond the basic visual perception typically assessed in current video benchmarks. Second, each example is annotated by human experts from scratch. We implement strict data quality controls to ensure the high quality of the dataset. Finally, each example is enriched with expert-annotated reasoning rationals and relevant domain knowledge, facilitating in-depth analysis. We conduct an extensive evaluation of 32 frontier multimodal foundation models on MMVU. The latest System-2-capable models, o1 and Gemini 2.0 Flash Thinking, achieve the highest performance among the tested models. However, they still fall short of matching human expertise. Through in-depth error analyses and case studies, we offer actionable insights for future advancements in expert-level, knowledge-intensive video understanding for specialized domains.

arxiv情報

著者	Yilun Zhao,Lujing Xie,Haowei Zhang,Guo Gan,Yitao Long,Zhiyuan Hu,Tongyan Hu,Weiyuan Chen,Chuhan Li,Junyang Song,Zhijian Xu,Chengye Wang,Weifeng Pan,Ziyao Shangguan,Xiangru Tang,Zhenwen Liang,Yixin Liu,Chen Zhao,Arman Cohan
発行日	2025-01-21 18:56:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー