SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

要約

ビジョン言語モデル（VLMS）では、視覚トークンは通常、テキストトークンと比較した場合、情報のスパースにもかかわらず、かなりの量の計算オーバーヘッドを負担します。
これに対処するために、ほとんどの既存のメソッドは、特定のトレーニングデータを使用して冗長な視覚トークンをプルネートするネットワークを学習します。
別の方法では、追加のパラメーターや微調整コストの必要性を排除するSparseVLMと呼ばれるテキスト誘導トレーニングフリートークン最適化メカニズムを提案します。
視覚的なトークンがVLMの言語推論でテキストトークンを補完することを考えると、関連するテキストトークンを選択して、自己関節マトリックスを使用して視覚トークンの重要性を評価し、情報を獲得しながらスパース性を最大化するために提案された戦略を使用して視覚トークンを剪定します。
特に、剪定されたトークンをよりコンパクトな表現に圧縮するトークンリサイクル方法とともに、各レイヤーのスパース化比を適応的に決定するためのランクベースの戦略を導入します。
実験結果は、SparseVLMが多くの画像およびビデオ理解タスクでさまざまなVLMの効率を向上させることを示しています。
たとえば、LlavaはSparseVLMを装備すると、フロップが54％減少し、元の精度の97％を維持しながら、CUDAレイテンシが37％減少します。
私たちのコードは、https：//github.com/gumpest/sparsevlmsで入手できます。

要約(オリジナル)

In vision-language models (VLMs), visual tokens usually bear a significant amount of computational overhead despite sparsity of information in them when compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens using certain training data. Differently, we propose a text-guided training-free token optimization mechanism dubbed SparseVLM that eliminates the need of extra parameters or fine-tuning costs. Given that visual tokens complement text tokens in VLM’s linguistic reasoning, we select relevant text tokens to rate the significance of visual tokens using self-attention matrices and, then, prune visual tokens using the proposed strategy to maximize sparsity while retaining information. In particular, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that SparseVLM increases the efficiency of various VLMs in a number of image and video understanding tasks. For example, LLaVA when equipped with SparseVLM achieves 54% reduction in FLOPs, 37% decrease in CUDA latency while maintaining 97% of its original accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

arxiv情報

著者	Yuan Zhang,Chun-Kai Fan,Junpeng Ma,Wenzhao Zheng,Tao Huang,Kuan Cheng,Denis Gudovskiy,Tomoyuki Okuno,Yohei Nakata,Kurt Keutzer,Shanghang Zhang
発行日	2025-02-06 14:31:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー