SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

要約

ビジョン言語モデル (VLM) では、ビジュアルトークンは、テキストトークンと比較して情報密度がまばらであるにもかかわらず、通常、大量の計算オーバーヘッドを消費します。
これに対処するために、既存の手法のほとんどはネットワークを学習して冗長なビジュアルトークンを取り除き、追加のトレーニングデータを必要とします。
これとは異なり、追加のパラメータや微調整コストを必要としない、SparseVLM と呼ばれる効率的なトレーニング不要のトークン最適化メカニズムを提案します。
具体的には、視覚的トークンが言語的推論のために VLM 内のテキストトークンを補完することを考慮して、視覚関連のテキストトークンを選択して、VLM から抽出された自己注意マトリックス内で視覚トークンの重要性を評価します。
次に、無関係なトークンを段階的に削除します。
重要な情報を保持しながらスパース性を最大化するために、プルーニングされたトークンをよりコンパクトな表現に圧縮するトークンリサイクル手法と並行して、各レイヤーのスパース化率を適応的に決定するランクベースの戦略を導入します。
実験結果は、SparseVLM がさまざまな画像およびビデオ理解タスクにわたってさまざまな VLM の効率を向上させることを示しています。
特に、SparseVLM を搭載した LLaVA は、93% の精度を維持しながら、78% の圧縮率で FLOP を 61% ～ 67% 削減します。
私たちのコードは https://github.com/Gumpest/SparseVLMs で入手できます。

要約(オリジナル)

In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% to 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy. Our code is available at https://github.com/Gumpest/SparseVLMs.

arxiv情報

著者	Yuan Zhang,Chun-Kai Fan,Junpeng Ma,Wenzhao Zheng,Tao Huang,Kuan Cheng,Denis Gudovskiy,Tomoyuki Okuno,Yohei Nakata,Kurt Keutzer,Shanghang Zhang
発行日	2024-10-09 15:04:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー