AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

要約

ビデオの理解の進歩にもかかわらず、現在のMLLMはタスクのカウントに苦労しています。
既存のベンチマークは、短いビデオ、クローズセットクエリ、手がかりの注釈の欠如、およびマルチモーダルのカバレッジが弱いことによって制限されます。
このペーパーでは、497の長いビデオを超える1,027のマルチモーダル質問と5,845の注釈付きの手がかりを備えた手動で注文された手がかりのカウントベンチマークであるCG-AVカウントを紹介します。
ブラックボックスとホワイトボックスの両方の評価をサポートし、エンドツーエンドと推論ベースのカウントの両方の包括的なテストベッドとして機能します。
モデルのカウント機能を改善する方法を探るために、GRPOとカリキュラム学習で訓練されたモデルであるAVリーズンを提案し、関連するタスクからカウント能力を一般化することを提案します。
AV-Reasonerは、複数のベンチマークにわたって最先端の結果を達成し、強化学習の有効性を実証しています。
ただし、実験では、ドメイン外のベンチマークでは、言語空間での推論がパフォーマンスの向上をもたらさないことが示されています。
コードとベンチマークは、https：//av-rasoner.github.ioで実現しています。

要約(オリジナル)

Despite progress in video understanding, current MLLMs struggle with counting tasks. Existing benchmarks are limited by short videos, close-set queries, lack of clue annotations, and weak multimodal coverage. In this paper, we introduce CG-AV-Counting, a manually-annotated clue-grounded counting benchmark with 1,027 multimodal questions and 5,845 annotated clues over 497 long videos. It supports both black-box and white-box evaluation, serving as a comprehensive testbed for both end-to-end and reasoning-based counting. To explore ways to improve model’s counting capability, we propose AV-Reasoner, a model trained with GRPO and curriculum learning to generalize counting ability from related tasks. AV-Reasoner achieves state-of-the-art results across multiple benchmarks, demonstrating the effectiveness of reinforcement learning. However, experiments show that on out-of-domain benchmarks, reasoning in the language space fails to bring performance gains. The code and benchmark have been realeased on https://av-reasoner.github.io.

arxiv情報

著者	Lidong Lu,Guo Chen,Zhiqi Li,Yicheng Liu,Tong Lu
発行日	2025-06-05 17:58:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AV-Reasoner: Improving and Benchmarking Clue-Grounded Audio-Visual Counting for MLLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー