Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

要約

最先端の変圧器ベースの大規模マルチモーダルモデル（LMMS）は、因果的自己触媒操作の二次複雑さのために1時間のビデオ入力を処理するのに苦労し、トレーニングと推論中の高い計算コストにつながります。
既存のトークン圧縮ベースの方法は、ビデオトークンの数を減らしますが、多くの場合、情報の損失が発生し、非常に長いシーケンスでは非効率的なままです。
この論文では、直交方向を探索して、Mamba-2ブロックを使用してビデオトークンを線形複雑さでエンコードするハイブリッドMamba-Transformerモデル（Vamba）を構築します。
トークンの削減がなければ、Vambaは1つのGPUで1024フレーム（640 $ \ Times $ 360）をエンコードできますが、トランスベースのモデルは256フレームのみをエンコードできます。
長いビデオ入力では、バンバはトレーニングと推論中にGPUメモリの使用量を少なくとも50％削減し、トランスベースのLMMと比較してトレーニングステップごとに速度をほぼ2倍にします。
私たちの実験結果は、VAMBAが、以前の効率的なビデオLMMSよりも挑戦的な1時間のビデオ理解ベンチマークLVBenchの精度を4.3％向上させ、長くて短いビデオ理解タスクの広範囲にわたって強力なパフォーマンスを維持することを示しています。

要約(オリジナル)

State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

arxiv情報

著者	Weiming Ren,Wentao Ma,Huan Yang,Cong Wei,Ge Zhang,Wenhu Chen
発行日	2025-03-14 16:45:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー