VideoDeepResearch: Long Video Understanding With Agentic Tool Using

要約

長いビデオ理解（LVU）は、タスクに固有の複雑さとコンテキストウィンドウの制約のため、現在のマルチモーダル大手言語モデル（MLLMS）に大きな課題を提示します。
LVUタスクに対処するには、拡張されたコンテキストウィンドウ、強力な視覚的知覚能力、および熟練したドメインの専門知識を備えた基礎MLLMが必要であると広く想定されています。
この作業では、長いビデオ理解のための新しいエージェントフレームワークであるVidedePresearchを導入することにより、この共通の信念に挑戦します。
私たちのアプローチは、マルチモーダルレトリバーや視覚的知覚を含むモジュラーマルチモーダルツールキットと組み合わせたテキストのみの大規模推論モデル（LRM）のみに依存しています。
各LVUタスクについて、システムは推論を通じて問題解決戦略を策定し、ツールを使用して必須のビデオコンテンツに選択的にアクセスし、利用します。
MLVU、Video-MME、LVBenchなど、人気のあるLVUベンチマークで広範な実験を実施しています。
我々の結果は、Videodeepresearchが既存のMLLMベースラインよりも大幅に改善され、MLVU（TEST）、LVBench、およびLongvidebench、それぞれ9.6％、6.6％、および3.9％を上回っていることを示しています。
これらの調査結果は、LVUの問題における重要な課題を克服する際のエージェントシステムの約束を強調しています。

要約(オリジナル)

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

arxiv情報

著者	Huaying Yuan,Zheng Liu,Junjie Zhou,Ji-Rong Wen,Zhicheng Dou
発行日	2025-06-12 15:39:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー