Clapper: Compact Learning and Video Representation in VLMs

要約

現在のビジョン言語モデル（VLM）は、多様なビデオ理解アプリケーション全体で顕著な機能を実証しています。
ビデオ入力用のVLMSの設計には、時間的次元を効果的にモデル化する（つまり、フレーム全体で依存関係をキャプチャする）、短いビデオと長いビデオの処理のバランスをとる必要があります。
具体的には、短いビデオでは細かい詳細の保存が必要ですが、長いビデオでは、広範な時間的コンテキストを効率的に処理するために視覚情報の戦略的圧縮が必要です。
しかし、私たちの経験的分析は、重要な制限を明らかにしています。ほとんどの既存のVLMは、元の視覚トークンの4分の1以下の視覚トークンを圧縮する際に、長いビデオ理解タスクで深刻なパフォーマンスの劣化を受けます。
短いビデオ入力と長いビデオ入力の両方のより効果的なモデリングを可能にするために、ビデオ表現に遅い速い戦略を利用し、既存のVLMバックボーン内で効率的な時間的空間エンコードのためにTimePerceiverという名前の新しいモジュールを導入する方法であるClapperを提案します。
私たちの方法を使用することにより、QA精度を損なうことなく、フレームあたりの視覚トークン（平均61トークン/フレーム）の13倍の圧縮を実現します。
私たちの実験では、ClapperはVideMommeで62.0％、MLVUで69.8％、TempCompassで67.4％を達成し、すべてビデオごとに6,000未満の視覚トークンを獲得しています。
コードはホームページで公開されます。

要約(オリジナル)

Current vision-language models (VLMs) have demonstrated remarkable capabilities across diverse video understanding applications. Designing VLMs for video inputs requires effectively modeling the temporal dimension (i.e. capturing dependencies across frames) and balancing the processing of short and long videos. Specifically, short videos demand preservation of fine-grained details, whereas long videos require strategic compression of visual information to handle extensive temporal contexts efficiently. However, our empirical analysis reveals a critical limitation: most existing VLMs suffer severe performance degradation in long video understanding tasks when compressing visual tokens below a quarter of their original visual tokens. To enable more effective modeling of both short and long video inputs, we propose Clapper, a method that utilizes a slow-fast strategy for video representation and introduces a novel module named TimePerceiver for efficient temporal-spatial encoding within existing VLM backbones. By using our method, we achieves 13x compression of visual tokens per frame (averaging 61 tokens/frame) without compromising QA accuracy. In our experiments, Clapper achieves 62.0% on VideoMME, 69.8% on MLVU, and 67.4% on TempCompass, all with fewer than 6,000 visual tokens per video. The code will be publicly available on the homepage.

arxiv情報

著者	Lingyu Kong,Hongzhi Zhang,Jingyuan Zhang,Jianzhao Huang,Kunze Li,Qi Wang,Fuzheng Zhang
発行日	2025-05-21 13:52:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Clapper: Compact Learning and Video Representation in VLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー