AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

要約

大規模言語モデル（LLM）の進歩により、LLMを視覚モデルに組み込むことで、動画理解タスクの改善が推進されている。しかし、既存のLLMベースのモデル（VideoLLaMA、VideoChatなど）のほとんどは、短時間の動画の処理に制約がある。最近では、視覚的特徴を抽出し、固定メモリサイズに圧縮することで、長時間の動画を理解する試みがなされている。しかしながら、これらの方法は、ビデオトークンをマージするために視覚的モダリティのみを利用し、視覚的クエリとテキストクエリ間の相関を見落としているため、複雑な質問応答タスクを効果的に処理することが困難である。長いビデオと複雑なプロンプトの課題に対処するために、我々はAdaCM$^2$を提案する。AdaCM$^2$は、適応的なクロスモダリティメモリ削減アプローチを初めてビデオストリーム上で自動回帰的にビデオとテキストのアライメントに導入する。ビデオキャプション、ビデオ質問応答、ビデオ分類などの様々なビデオ理解タスクに対する我々の広範な実験により、AdaCM$^2$は、メモリ使用量を大幅に削減しながら、複数のデータセットにおいて最先端の性能を達成することが実証された。特に、LVUデータセットの複数のタスクにおいて、GPUメモリ消費量を最大65%削減しながら、4.5%の改善を達成しています。

要約(オリジナル)

The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.

arxiv情報

著者	Yuanbin Man,Ying Huang,Chengming Zhang,Bingzhe Li,Wei Niu,Miao Yin
発行日	2025-04-04 17:58:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー