Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

要約

ビデオ理解モデルは、多くの場合、高い計算要件、広範なパラメーターカウント、および推論速度が遅いため、実用的には非効率的になります。
これらの課題に取り組むために、10億未満のパラメーターで動作するように設計された効率的なマルチモーダルフレームワークであるMobile-VideOgptを提案します。
従来のビデオの大規模なマルチモーダルモデル（LMMS）とは異なり、モバイルVideoGPTは、軽量のデュアルビジュアルエンコーダー、効率的なプロジェクター、および小言語モデル（SLM）で構成され、リアルタイムスループットを可能にします。
効率をさらに向上させるために、キーフレームを選択するための注意ベースのフレームスコアリングメカニズムを提示し、冗長な視覚トークンをプルーナし、本質的なコンテキストキューを保存する効率的なトークンプロジェクターを提示します。
確立された6つのビデオ理解ベンチマーク（例：MVBench、Egoschema、NextQA、Cenceptest）でモデルを評価します。
我々の結果は、モバイル-VideOGPT-0.5Bが毎秒最大46トークンを生成し、既存の最先端の0.5Bパラメーターモデルを平均で6ポイント上回ることができることを示しています。
私たちのコードとモデルは、https：//github.com/amshaker/mobile-videogptで公開されています。

要約(オリジナル)

Video understanding models often struggle with high computational requirements, extensive parameter counts, and slow inference speed, making them inefficient for practical use. To tackle these challenges, we propose Mobile-VideoGPT, an efficient multimodal framework designed to operate with fewer than a billion parameters. Unlike traditional video large multimodal models (LMMs), Mobile-VideoGPT consists of lightweight dual visual encoders, efficient projectors, and a small language model (SLM), enabling real-time throughput. To further improve efficiency, we present an Attention-Based Frame Scoring mechanism to select the key-frames, along with an efficient token projector that prunes redundant visual tokens and preserves essential contextual cues. We evaluate our model across well-established six video understanding benchmarks (e.g., MVBench, EgoSchema, NextQA, and PercepTest). Our results show that Mobile-VideoGPT-0.5B can generate up to 46 tokens per second while outperforming existing state-of-the-art 0.5B-parameter models by 6 points on average with 40% fewer parameters and more than 2x higher throughput. Our code and models are publicly available at: https://github.com/Amshaker/Mobile-VideoGPT.

arxiv情報

著者	Abdelrahman Shaker,Muhammad Maaz,Chenhui Gou,Hamid Rezatofighi,Salman Khan,Fahad Shahbaz Khan
発行日	2025-03-27 17:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mobile-VideoGPT: Fast and Accurate Video Understanding Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー