HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

要約

マルチモーダルの大手言語モデル（MLLM）の進歩にもかかわらず、現在のアプローチは、フレームとコンテキストの長さの制限により、中程度から長いビデオ理解に苦労しています。
その結果、これらのモデルは多くの場合、フレームサンプリングに依存します。フレームサンプリングは、時間の経過とともに重要な情報が欠落しているリスクがあり、タスク固有の関連性を欠いています。
これらの課題に対処するために、LLMのコンテキスト長の制限を避けながら、フレームの必要性をバイパスするためにフレームを順次処理するタスクに対応する階層Q-formerベースのフレームワークであるHierarqを導入します。
軽量の2ストリーム言語誘導機能変調器を導入して、ビデオ理解にタスク認識を組み込むことができます。エンティティストリームは、短いコンテキスト内でフレームレベルのオブジェクト情報をキャプチャし、シーンストリームはより広範な相互作用を識別します。
各ストリームは、提案されているHierachicalクエリトランス（HierARQ）が短期的および長期的なコンテキストを効果的にキャプチャできるようにする専用のメモリバンクによってサポートされています。
ビデオの理解、質問への回答、およびキャプションタスク全体の10のビデオベンチマークに関する広範な評価は、ほとんどのデータセットにわたってHierARQの最先端のパフォーマンスを示し、包括的なビデオ分析のための堅牢性と効率性を証明しています。

要約(オリジナル)

Despite advancements in multimodal large language models (MLLMs), current approaches struggle in medium-to-long video understanding due to frame and context length limitations. As a result, these models often depend on frame sampling, which risks missing key information over time and lacks task-specific relevance. To address these challenges, we introduce HierarQ, a task-aware hierarchical Q-Former based framework that sequentially processes frames to bypass the need for frame sampling, while avoiding LLM’s context length limitations. We introduce a lightweight two-stream language-guided feature modulator to incorporate task awareness in video understanding, with the entity stream capturing frame-level object information within a short context and the scene stream identifying their broader interactions over longer period of time. Each stream is supported by dedicated memory banks which enables our proposed Hierachical Querying transformer (HierarQ) to effectively capture short and long-term context. Extensive evaluations on 10 video benchmarks across video understanding, question answering, and captioning tasks demonstrate HierarQ’s state-of-the-art performance across most datasets, proving its robustness and efficiency for comprehensive video analysis.

arxiv情報

著者	Shehreen Azad,Vibhav Vineet,Yogesh Singh Rawat
発行日	2025-03-11 16:21:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー