xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

要約

我々は、xGen-MM-Vid (BLIP-3-Video) を紹介します。これはビデオ用のマルチモーダル言語モデルであり、特に複数のフレームにわたる時間情報を効率的にキャプチャするように設計されています。
BLIP-3-Video は、従来のビジュアルトークナイザーに加えて「テンポラルエンコーダー」を利用し、複数のフレームにわたる一連のトークンをコンパクトなビジュアルトークンのセットにマッピングします。
これにより、BLIP3-Video は競合モデルよりもはるかに少ないビジュアルトークンを使用できるようになります (例: 32 対 4608 トークン)。
私たちは、学習可能な時空間プーリングやトークンチューリングマシンのような逐次モデルなど、さまざまなタイプの時間エンコーダーを調査します。
BLIP-3-Video は、はるかに小型 (つまり 4B) であり、より少ないビジュアルトークンを使用することでより効率的でありながら、はるかに大きな最先端のモデル (例: 34B) に匹敵するビデオ質問応答精度を得ることが実験的に確認されています。
。
プロジェクトの Web サイトは https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html にあります。

要約(オリジナル)

We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames. BLIP-3-Video takes advantage of the ‘temporal encoder’ in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. This enables BLIP3-Video to use much fewer visual tokens than its competing models (e.g., 32 vs. 4608 tokens). We explore different types of temporal encoders, including learnable spatio-temporal pooling as well as sequential models like Token Turing Machines. We experimentally confirm that BLIP-3-Video obtains video question-answering accuracies comparable to much larger state-of-the-art models (e.g., 34B), while being much smaller (i.e., 4B) and more efficient by using fewer visual tokens. The project website is at https://www.salesforceairesearch.com/opensource/xGen-MM-Vid/index.html

arxiv情報

著者	Michael S. Ryoo,Honglu Zhou,Shrikant Kendre,Can Qin,Le Xue,Manli Shu,Silvio Savarese,Ran Xu,Caiming Xiong,Juan Carlos Niebles
発行日	2024-10-21 17:59:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー