TVBench: Redesigning Video-Language Evaluation

要約

大規模な言語モデルは、視覚モデルと統合されることで、ビデオ理解さえ可能にする素晴らしい性能を示している。しかし、これらのビデオモデルの評価には独自の課題があり、いくつかのベンチマークが提案されている。本稿では、現在最も利用されているビデオ言語ベンチマークが、時間的推論をあまり必要とせずに解けることを示す。すなわち、(i)単一フレームからの静的な情報だけで、タスクを解くのに十分であることが多い。(ii)質問と回答候補のテキストが過度に情報量が多く、視覚的入力に頼ることなく、モデルが正しく回答できる。(iii)世界知識だけで、多くの質問に答えることができ、ベンチマークを視覚的推論ではなく、知識の再現のテストにしている。さらに、我々は、映像理解のための自由形式の質問応答ベンチマークが同様の問題に悩まされていること、LLMを用いた自動評価プロセスが信頼性に欠け、代替手段として不適切であることを発見した。その解決策として、我々は新しいオープンソースの動画多肢選択問題回答ベンチマークであるTVBenchを提案し、広範な評価を通して、このベンチマークが高度な時間的理解を必要とすることを実証する。驚くべきことに、最新のビデオ言語モデルのほとんどが、TVBenchにおいてランダムな性能と同程度の性能を示し、Qwen2-VLやTarsierのような少数のモデルのみが、このベースラインを明らかに上回ることがわかった。

要約(オリジナル)

Large language models have demonstrated impressive performance when integrated with vision models even enabling video understanding. However, evaluating these video models presents its own unique challenges, for which several benchmarks have been proposed. In this paper, we show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning. We identified three main issues in existing datasets: (i) static information from single frames is often sufficient to solve the tasks (ii) the text of the questions and candidate answers is overly informative, allowing models to answer correctly without relying on any visual input (iii) world knowledge alone can answer many of the questions, making the benchmarks a test of knowledge replication rather than visual reasoning. In addition, we found that open-ended question-answering benchmarks for video understanding suffer from similar issues while the automatic evaluation process with LLMs is unreliable, making it an unsuitable alternative. As a solution, we propose TVBench, a novel open-source video multiple-choice question-answering benchmark, and demonstrate through extensive evaluations that it requires a high level of temporal understanding. Surprisingly, we find that most recent state-of-the-art video-language models perform similarly to random performance on TVBench, with only a few models such as Qwen2-VL, and Tarsier clearly surpassing this baseline.

arxiv情報

著者	Daniel Cores,Michael Dorkenwald,Manuel Mucientes,Cees G. M. Snoek,Yuki M. Asano
発行日	2025-01-03 11:21:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

TVBench: Redesigning Video-Language Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー