MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

要約

近年、ビジョン言語モデル (VLM) により、ビデオの理解が大幅に進歩しました。
ただし、重要な機能であるきめ細かい動作の理解は、現在のベンチマークではまだ調査されていません。
このギャップに対処するために、ビデオ理解モデルのきめ細かい動きの理解を評価するために設計された包括的な評価ベンチマークである MotionBench を提案します。
MotionBench は、モーション指向の質問タイプの 6 つの主要なカテゴリを通じてモデルのモーションレベルの認識を評価し、さまざまなソースから収集されたデータを含めて、現実世界のビデオコンテンツの広範な表現を保証します。
実験結果から、既存の VLM は細かい動きを理解する能力が低いことが明らかになりました。
LLM の限られたシーケンス長内できめの細かい動きを認識する VLM の能力を強化するために、ビデオ特徴圧縮用に最適化された VLM アーキテクチャをレビューする広範な実験を実施し、斬新で効率的なスルーエンコーダ (TE) フュージョン手法を提案します。
実験によれば、より高いフレームレートの入力と TE Fusion によりモーションの理解が向上しますが、まだ改善の余地がかなりあります。
私たちのベンチマークは、きめ細かい動きの理解の重要性を強調し、より有能なビデオ理解モデルの開発を導き、動機付けることを目的としています。
プロジェクトページ: https://motion-bench.github.io 。

要約(オリジナル)

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability – fine-grained motion comprehension – remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models’ motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM’s ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

arxiv情報

著者	Wenyi Hong,Yean Cheng,Zhuoyi Yang,Weihan Wang,Lefan Wang,Xiaotao Gu,Shiyu Huang,Yuxiao Dong,Jie Tang
発行日	2025-01-06 11:57:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー