MINERVA: Evaluating Complex Video Reasoning

要約

マルチモーダルLLMSはビデオベンチマークに焦点を合わせていますが、ほとんどのビデオベンチマークは、中級または解釈可能な推論ステップなしで、結果の監督のみを提供します。
これにより、モデルが真に知覚的情報と時間的情報をビデオについて推論することができるかどうかを評価するか、偶然または言語学的バイアスを悪用することで正しい答えを得ることができるかどうかを評価することが困難になります。
これを改善するために、最新のマルチモーダルモデルにMinervaと呼ばれる新しいビデオ推論データセットを提供します。
データセットの各質問には、5つの回答の選択肢と、詳細な手作りの推論の痕跡が付属しています。
データセットはマルチモーダルで、ビデオドメインと長さの点で多様であり、複雑なマルチステップの質問で構成されています。
広範なベンチマークは、データセットがフロンティアのオープンソースと独自のモデルに課題を提供することを示しています。
さまざまなモデルにわたって一般的な障害モードを特定するために、微調整されたエラー分析を実行し、推論エラーの分類法を作成します。
これを使用して、ビデオ推論の痕跡を採点するための人間とLLMとしてのジャジーの両方の方法を探索し、障害モードは主に時間的局在に関連していることを発見し、その後、論理的または完全性エラーとは対照的に視覚的知覚エラーが続きます。
データセットは、質問とともに、候補者と推論のトレースとともに、https://github.com/google-deepmind/neptune?tab=readme-ov-file \#minervaで公開されます。

要約(オリジナル)

Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challenging to assess if models are truly able to combine perceptual and temporal information to reason about videos, or simply get the correct answer by chance or by exploiting linguistic biases. To remedy this, we provide a new video reasoning dataset called MINERVA for modern multimodal models. Each question in the dataset comes with 5 answer choices, as well as detailed, hand-crafted reasoning traces. Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions. Extensive benchmarking shows that our dataset provides a challenge for frontier open-source and proprietary models. We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors. We use this to explore both human and LLM-as-a-judge methods for scoring video reasoning traces, and find that failure modes are primarily related to temporal localization, followed by visual perception errors, as opposed to logical or completeness errors. The dataset, along with questions, answer candidates and reasoning traces will be publicly available under https://github.com/google-deepmind/neptune?tab=readme-ov-file\#minerva.

arxiv情報

著者	Arsha Nagrani,Sachit Menon,Ahmet Iscen,Shyamal Buch,Ramin Mehran,Nilpa Jha,Anja Hauth,Yukun Zhu,Carl Vondrick,Mikhail Sirotenko,Cordelia Schmid,Tobias Weyand
発行日	2025-05-01 17:41:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MINERVA: Evaluating Complex Video Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー