Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

要約

ビデオ大規模な言語モデル（Videollm）はビデオ理解に優れていますが、豊富な視覚トークンの二次複雑さのために効率の課題に直面します。
Videollmsのトークン圧縮方法の体系的な分析により、2つの重要な問題が明らかになります。（i）フレーム全体で独特の視覚信号を見落とし、情報の損失につながる。
（ii）実装の制約に苦しみ、現代のアーキテクチャまたは効率的なオペレーターとの互換性を引き起こします。
これらの課題に対処するために、Videollmトークン圧縮の3つの設計原則を蒸留し、プラグアンドプレイ推論の加速フレームワーク「Video Compression Commander」（VIDCOM2）を提案します。
各フレームの一意性を定量化することにより、VIDCOM2はフレーム間の圧縮強度を適応的に調整し、ビデオシーケンスの冗長性を低減しながら、重要な情報を効果的に保存します。
さまざまなVideollmsやベンチマークにわたる広範な実験は、VIDCOM2の優れた性能と効率性を示しています。
Vidcom2は25％のビジュアルトークンで、LLAVA-OVの元のパフォーマンスの99.6％を達成し、LLMの発電レイテンシの70.8％を削減します。
特に、フレーム圧縮調整戦略は、パフォーマンスをさらに向上させるために、他のトークン圧縮方法と互換性があります。
私たちのコードは、https：//github.com/xuyang-liu16/vidcom2で入手できます。

要約(オリジナル)

Video large language models (VideoLLM) excel at video understanding, but face efficiency challenges due to the quadratic complexity of abundant visual tokens. Our systematic analysis of token compression methods for VideoLLMs reveals two critical issues: (i) overlooking distinctive visual signals across frames, leading to information loss; (ii) suffering from implementation constraints, causing incompatibility with modern architectures or efficient operators. To address these challenges, we distill three design principles for VideoLLM token compression and propose a plug-and-play inference acceleration framework ‘Video Compression Commander’ (VidCom2). By quantifying each frame’s uniqueness, VidCom2 adaptively adjusts compression intensity across frames, effectively preserving essential information while reducing redundancy in video sequences. Extensive experiments across various VideoLLMs and benchmarks demonstrate the superior performance and efficiency of our VidCom2. With only 25% visual tokens, VidCom2 achieves 99.6% of the original performance on LLaVA-OV while reducing 70.8% of the LLM generation latency. Notably, our Frame Compression Adjustment strategy is compatible with other token compression methods to further improve their performance. Our code is available at https://github.com/xuyang-liu16/VidCom2.

arxiv情報

著者	Xuyang Liu,Yiyu Wang,Junpeng Ma,Linfeng Zhang
発行日	2025-05-20 14:52:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー