AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

要約

大規模言語モデル (LLM) の進歩により、LLM をビジュアルモデルに組み込むことにより、ビデオ理解タスクの改善が推進されました。
ただし、ほとんどの既存の LLM ベースのモデル (VideoLLaMA、VideoChat など) は、短時間のビデオの処理に制限されています。
最近では、視覚的特徴を抽出して固定メモリサイズに圧縮することで、長時間のビデオを理解しようとしています。
それにもかかわらず、これらの方法は視覚的なモダリティのみを利用してビデオトークンを結合し、視覚的なクエリとテキストのクエリ間の相関関係を見落とすため、複雑な質問応答タスクを効果的に処理することが困難になります。
長いビデオと複雑なプロンプトの課題に対処するために、私たちは AdaCM$^2$ を提案します。これは、ビデオストリーム上で自動回帰的な方法でビデオとテキストの位置合わせに適応的なクロスモダリティメモリ削減アプローチを初めて導入します。
ビデオキャプション、ビデオ質問応答、ビデオ分類などのさまざまなビデオ理解タスクに関する広範な実験により、AdaCM$^2$ がメモリ使用量を大幅に削減しながら、複数のデータセットにわたって最先端のパフォーマンスを達成することが実証されました。
特に、LVU データセット内の複数のタスク全体で 4.5% の改善が達成され、GPU メモリ消費量が最大 65% 削減されました。

要約(オリジナル)

The advancements in large language models (LLMs) have propelled the improvement of video understanding tasks by incorporating LLMs with visual models. However, most existing LLM-based models (e.g., VideoLLaMA, VideoChat) are constrained to processing short-duration videos. Recent attempts to understand long-term videos by extracting and compressing visual features into a fixed memory size. Nevertheless, those methods leverage only visual modality to merge video tokens and overlook the correlation between visual and textual queries, leading to difficulties in effectively handling complex question-answering tasks. To address the challenges of long videos and complex prompts, we propose AdaCM$^2$, which, for the first time, introduces an adaptive cross-modality memory reduction approach to video-text alignment in an auto-regressive manner on video streams. Our extensive experiments on various video understanding tasks, such as video captioning, video question answering, and video classification, demonstrate that AdaCM$^2$ achieves state-of-the-art performance across multiple datasets while significantly reducing memory usage. Notably, it achieves a 4.5% improvement across multiple tasks in the LVU dataset with a GPU memory consumption reduction of up to 65%.

arxiv情報

著者	Yuanbin Man,Ying Huang,Chengming Zhang,Bingzhe Li,Wei Niu,Miao Yin
発行日	2024-11-19 18:04:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AdaCM$^2$: On Understanding Extremely Long-Term Video with Adaptive Cross-Modality Memory Reduction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー