SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

要約

私たちは、一般的に使用される LLM のトークンバジェットを超えることなく、詳細な空間セマンティクスと長距離の時間コンテキストを共同でキャプチャできる、トレーニング不要のビデオ大規模言語モデル (LLM) である SlowFast-LLaVA (略して SF-LLaVA) を提案します。
これは、ビデオ LLM の入力の 2 ストリーム SlowFast 設計を使用して、サンプリングされたビデオフレームからの特徴を効果的な方法で集約することによって実現されます。
具体的には、低速パスウェイは、可能な限り多くの空間詳細を維持しながら (たとえば、24×24 トークンを使用して) 低フレームレートで特徴を抽出します。一方、高速パスウェイは、高フレームレートで動作しますが、より大きな空間プーリングストライド (たとえば、6 倍のダウンサンプリング) を使用します。
) モーションキューに焦点を当てます。
その結果、この設計により、ビデオの詳細を理解するのに役立つ空間的特徴と時間的特徴の両方を適切にキャプチャできるようになります。
実験結果は、SF-LLaVA が幅広いビデオタスクにおいてトレーニング不要の既存の方法よりも優れたパフォーマンスを発揮することを示しています。
一部のベンチマークでは、ビデオデータセットに基づいて微調整された最先端のビデオ LLM と比較して、同等またはそれ以上のパフォーマンスを達成します。

要約(オリジナル)

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24×24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.

arxiv情報

著者	Mingze Xu,Mingfei Gao,Zhe Gan,Hong-You Chen,Zhengfeng Lai,Haiming Gang,Kai Kang,Afshin Dehghan
発行日	2024-07-22 17:58:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー