Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

要約

ビジョン言語モデル（VLM）は、視覚データとテキストデータを共同で処理する際に強力な機能を示します。
ただし、特に長期のビデオシナリオでは、冗長な視覚情報のために、多くの場合、かなりの計算オーバーヘッドが発生します。
既存のアプローチは、主に視覚トークン剪定のいずれかに焦点を当てています。これは、時空間的依存関係を見落としたり、有益なフレームを識別したり他の人を破棄したりするキーフレームの選択を見落とす可能性があります。
この作業では、トークンプルーニングとキーフレーム選択の欠点を克服する新しいフレームワークであるKVTP（キーフレーム指向のビジョントークンプルーニング）を提案します。
クエリに関連するフレームに基づいて剪定レートを適応的に割り当てることにより、KVTPは重要なコンテキスト情報を効果的に保持しながら、冗長な計算を大幅に削減します。
VLMSの長い形式のビデオ理解能力を徹底的に評価するために、Videomme、Egoschema、およびNextQAのサブセットをキュレーションし、再編成し、Sparsekv-Qaという名前の統一ベンチマークになりました。
さまざまなスケールのVLMを使用した実験は、KVTPが時空間およびコンテキストの一貫性を損なうことなく、トークンの使用を80％削減できることを示しています。
これらの結果は、効率的な長いビデオ処理におけるアプローチの有効性を示しており、よりスケーラブルなVLM展開を促進します。

要約(オリジナル)

Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video scenarios. Existing approaches predominantly focus on either vision token pruning, which may overlook spatio-temporal dependencies, or keyframe selection, which identifies informative frames but discards others, thus disrupting contextual continuity. In this work, we propose KVTP (Keyframe-oriented Vision Token Pruning), a novel framework that overcomes the drawbacks of token pruning and keyframe selection. By adaptively assigning pruning rates based on frame relevance to the query, KVTP effectively retains essential contextual information while significantly reducing redundant computation. To thoroughly evaluate the long-form video understanding capacities of VLMs, we curated and reorganized subsets from VideoMME, EgoSchema, and NextQA into a unified benchmark named SparseKV-QA that highlights real-world scenarios with sparse but crucial events. Our experiments with VLMs of various scales show that KVTP can reduce token usage by 80% without compromising spatiotemporal and contextual consistency, significantly cutting computation while maintaining the performance. These results demonstrate our approach’s effectiveness in efficient long-video processing, facilitating more scalable VLM deployment.

arxiv情報

著者	Yudong Liu,Jingwei Sun,Yueqian Lin,Jingyang Zhang,Ming Yin,Qinsi Wang,Jianyi Zhang,Hai Li,Yiran Chen
発行日	2025-04-24 14:53:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー