MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models

要約

複数のビデオフレーム機能を圧縮する前に、大規模なマルチモーダルモデルにフィードするための効率的なフレームワークを提案し、それにより、長いまたは密集したビデオから生じる重度のトークン爆発を軽減します。
当社の設計は、ゲート付きスキップ接続と、定期的に挿入された学習クエリに適用される学習可能な加重平均プーリングメカニズムを備えた双方向の状態空間ベースのブロックを活用しています。
この構造により、空間的および時間的次元の両方にわたって階層的なダウンサンプリングが可能になり、コスト効率の高い方法でパフォーマンスを維持できます。
挑戦的な長く密集したビデオ理解タスクを超えて、私たちのアプローチは、最先端のモデルに対する競争結果を示していますが、全体的なトークン予算を大幅に削減します。
特に、提案されている状態空間ブロックを従来の変圧器に置き換えると、実質的な性能劣化が発生し、マルチフレームビデオデータを効果的に圧縮するための状態空間モデリングの利点を強調します。
私たちのフレームワークは、リソースに配慮した効率性を強調しており、実際の展開に実用的です。
複数のベンチマークにわたるスケーラビリティと一般性を検証し、効率的なリソース使用と包括的なビデオ理解の二重の目的を達成します。

要約(オリジナル)

We propose an efficient framework to compress multiple video-frame features before feeding them into large multimodal models, thereby mitigating the severe token explosion arising from long or dense videos. Our design leverages a bidirectional state-space-based block equipped with a gated skip connection and a learnable weighted-average pooling mechanism applied to periodically inserted learned queries. This structure enables hierarchical downsampling across both spatial and temporal dimensions, preserving performance in a cost-effective manner. Across challenging long and dense video understanding tasks, our approach demonstrates competitive results against state-of-the-art models, while significantly reducing overall token budget. Notably, replacing our proposed state-space block with a conventional Transformer results in substantial performance degradation, highlighting the advantages of state-space modeling for effectively compressing multi-frame video data. Our framework emphasizes resource-conscious efficiency, making it practical for real-world deployments. We validate its scalability and generality across multiple benchmarks, achieving the dual objectives of efficient resource usage and comprehensive video understanding.

arxiv情報

著者	Geewook Kim,Minjoon Seo
発行日	2025-06-16 14:49:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー