BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

要約

長いビデオのビデオ質問応答（VQA）は、関連する情報を抽出し、多くの冗長フレームから長距離依存関係をモデリングするという重要な課題を提起します。
自己関節メカニズムは、シーケンスモデリングの一般的なソリューションを提供しますが、長いビデオで膨大な数の空間的トークンに適用すると、法外なコストがあります。
ほとんどの先行方法は、スパースフレームサンプリングを介して入力長を削減したり、時空プーリングを介して大規模な言語モデル（LLM）に渡された出力シーケンスを圧縮したりするなど、計算コストを削減するための圧縮戦略に依存しています。
ただし、これらの素朴なアプローチは、冗長な情報を過剰に表現し、顕著なイベントや急速に発生する時空パターンを見逃します。
この作業では、長型のビデオを処理するための効率的な状態空間モデルであるBimbaを紹介します。
私たちのモデルは、選択的スキャンアルゴリズムを活用して、高次元ビデオから重要な情報を効果的に選択し、効率的なLLM処理のために縮小トークンシーケンスに変換することを学習します。
広範な実験は、BimbaがPerception、Next-QA、Egoschema、Vnbench、Longvideobench、Video-Mmeなど、複数の長型VQAベンチマークで最先端の精度を達成することを示しています。
コードとモデルは、https://sites.google.com/view/bimba-mllmで公開されています。

要約(オリジナル)

Video Question Answering (VQA) in long videos poses the key challenge of extracting relevant information and modeling long-range dependencies from many redundant frames. The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos. Most prior methods rely on compression strategies to lower the computational cost, such as reducing the input length via sparse frame sampling or compressing the output sequence passed to the large language model (LLM) via space-time pooling. However, these naive approaches over-represent redundant information and often miss salient events or fast-occurring space-time patterns. In this work, we introduce BIMBA, an efficient state-space model to handle long-form videos. Our model leverages the selective scan algorithm to learn to effectively select critical information from high-dimensional video and transform it into a reduced token sequence for efficient LLM processing. Extensive experiments demonstrate that BIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks, including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, and Video-MME. Code, and models are publicly available at https://sites.google.com/view/bimba-mllm.

arxiv情報

著者	Md Mohaiminul Islam,Tushar Nagarajan,Huiyu Wang,Gedas Bertasius,Lorenzo Torresani
発行日	2025-03-12 17:57:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー