VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

要約

独自の時間的次元を備えたビデオは、回答が視覚的で解釈可能な証拠に直接リンクされている正確な根拠のある理解を要求します。
大規模な言語モデル内の推論能力の大きなブレークスルーにもかかわらず、マルチモーダルの推論 – 特にビデオの場合 – は未開拓のままです。
この作業では、一時的なビデオ理解のために設計された新しいビデオ言語エージェントであるVideomindを紹介します。
Videomindには、2つの重要なイノベーションが組み込まれています。（i）動画の時間的推論に不可欠な機能を特定し、さまざまな役割を調整するためのプランナー、時間的局在化のためのグラウンダー、時間的間隔の精度を評価する検証剤、および質問回答者の応答者を含む役割ベースのエージェントワークフローを開発します。
（ii）これらの多様な役割を効率的に統合するために、複数のモデルのオーバーヘッドを避けながら、軽量のロラアダプターを介してシームレスなロールスイッチングを可能にし、効率と柔軟性のバランスをとることを可能にします。
14のパブリックベンチマークでの広範な実験は、当社のエージェントが、根拠のあるビデオ質問に3つ、ビデオの時間的接地で6つ、5つの一般的なビデオ質問回答で5つを含む多様なビデオ理解タスクで最先端のパフォーマンスを達成し、進行するビデオエージェントと長型の時間的推論におけるその有効性を強調していることを示しています。

要約(オリジナル)

Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning – especially for videos – remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.

arxiv情報

著者	Ye Liu,Kevin Qinghong Lin,Chang Wen Chen,Mike Zheng Shou
発行日	2025-03-17 17:59:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー