A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

要約

ビデオ言語モデルの時空間的理解と推論能力を評価するための既存のベンチマークは、表面的な視覚またはテキストの手がかりに基づいたショートカットソリューションの存在により、インフレを獲得しやすくなります。
このペーパーでは、ビデオ言語モデルの物理的理解を評価するためのシンプルなショートカット対応ビデオQAベンチマークである最小限のビデオペア（MVP）ベンチマークを導入することにより、モデルのパフォーマンスを正確に評価する際の課題を軽減します。
ベンチマークは、物理的な世界の理解に焦点を当てた55K高品質の多肢選択ビデオQAの例で構成されています。
例は、9つのビデオデータソースからキュレーションされており、一人称のエゴセントリックビデオおよびエクソセントリックなビデオ、ロボット相互作用データ、認知科学の直感的な物理ベンチマークにまたがっています。
表面的な視覚的またはテキストのキューとバイアスに依存するショートカットソリューションを緩和するために、MVPの各サンプルには最小変化のペアがあります。
質問に正しく答えるには、モデルは最小変化のペアの両方の例に対して正しい答えを提供する必要があります。
そのため、視覚的またはテキストのバイアスのみに依存するモデルは、ランダムなパフォーマンス以下で達成されます。
MVPの人間のパフォーマンスは92.9 \％ですが、最高のオープンソースの最先端のビデオ言語モデルは、25 \％でのランダムパフォーマンスと比較して40.2 \％を達成します。

要約(オリジナル)

Existing benchmarks for assessing the spatio-temporal understanding and reasoning abilities of video language models are susceptible to score inflation due to the presence of shortcut solutions based on superficial visual or textual cues. This paper mitigates the challenges in accurately assessing model performance by introducing the Minimal Video Pairs (MVP) benchmark, a simple shortcut-aware video QA benchmark for assessing the physical understanding of video language models. The benchmark is comprised of 55K high-quality multiple-choice video QA examples focusing on physical world understanding. Examples are curated from nine video data sources, spanning first-person egocentric and exocentric videos, robotic interaction data, and cognitive science intuitive physics benchmarks. To mitigate shortcut solutions that rely on superficial visual or textual cues and biases, each sample in MVP has a minimal-change pair — a visually similar video accompanied by an identical question but an opposing answer. To answer a question correctly, a model must provide correct answers for both examples in the minimal-change pair; as such, models that solely rely on visual or textual biases would achieve below random performance. Human performance on MVP is 92.9\%, while the best open-source state-of-the-art video-language model achieves 40.2\% compared to random performance at 25\%.

arxiv情報

著者	Benno Krojer,Mojtaba Komeili,Candace Ross,Quentin Garrido,Koustuv Sinha,Nicolas Ballas,Mahmoud Assran
発行日	2025-06-11 17:57:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Shortcut-aware Video-QA Benchmark for Physical Understanding via Minimal Video Pairs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー