VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

要約

最近の大規模な視覚言語モデル（LVLMS）は、より細かい粒度の視覚的知覚とエンコーディングを組み込むことにより、高度なマルチモーダル理解を持っています。
ただし、このような方法は、視覚的なトークンシーケンスが長くなるため、かなりの計算コストが発生し、リアルタイムの展開に課題をもたらします。
これを緩和するために、以前の研究では、視覚エンコーダの出力層または言語モデルの初期層のいずれかで、重要でない視覚トークンの剪定を調査しました。
この作業では、これらの設計の選択を再検討し、視覚エンコードと言語デコード段階全体で視覚トークンがどのように処理されるかについての包括的な経験的研究を通じて、それらの有効性を再評価します。
これらの洞察に導かれて、VSCANを提案します。VSCANは、次のようにトークンの冗長性に対処する2段階の視覚トークン削減フレームワークであると提案します。
4つのLVLMにわたる広範な実験結果は、推論の加速におけるVSCANの有効性を検証し、16のベンチマークでの現在の最先端よりも優れたパフォーマンスを実証します。
特に、LLAVA-Next-7Bに適用すると、VSCANは、元のパフォーマンスの95.4％を保持しながら、Prefillingで2.91 $ \ Times $ speedupとFlopsの10ドルのTimes $削減を達成します。

要約(オリジナル)

Recent Large Vision-Language Models (LVLMs) have advanced multi-modal understanding by incorporating finer-grained visual perception and encoding. However, such methods incur significant computational costs due to longer visual token sequences, posing challenges for real-time deployment. To mitigate this, prior studies have explored pruning unimportant visual tokens either at the output layer of the visual encoder or at the early layers of the language model. In this work, we revisit these design choices and reassess their effectiveness through comprehensive empirical studies of how visual tokens are processed throughout the visual encoding and language decoding stages. Guided by these insights, we propose VScan, a two-stage visual token reduction framework that addresses token redundancy by: (1) integrating complementary global and local scans with token merging during visual encoding, and (2) introducing pruning at intermediate layers of the language model. Extensive experimental results across four LVLMs validate the effectiveness of VScan in accelerating inference and demonstrate its superior performance over current state-of-the-arts on sixteen benchmarks. Notably, when applied to LLaVA-NeXT-7B, VScan achieves a 2.91$\times$ speedup in prefilling and a 10$\times$ reduction in FLOPs, while retaining 95.4% of the original performance.

arxiv情報

著者	Ce Zhang,Kaixin Ma,Tianqing Fang,Wenhao Yu,Hongming Zhang,Zhisong Zhang,Yaqi Xie,Katia Sycara,Haitao Mi,Dong Yu
発行日	2025-05-28 17:59:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー