PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

要約

大規模なビジョン言語モデル（LVLMS）は、さまざまなマルチモーダルタスクにわたって顕著な機能を実証しています。
ただし、それらの推論効率は、デコード中に処理される多数の視覚トークンによって制約されます。
この課題に対処するために、レイヤーレベルの保持速度割り当てとヘッドレベルの視覚トークン剪定を含む2レベルの微粒剪定法である、層ごとのヘッド視力トークンプルーニング（PLPHP）を提案します。
Decoder層全体のVision Token Reatention現象によって動機付けられ、トークン保持速度層ごとに動的に調整します。
視覚情報に強い注意を払うレイヤーは、より多くのビジョントークンを維持しますが、視力の注意が低いレイヤーは積極的に剪定されます。
さらに、PLPHPは注意ヘッドレベルで剪定を適用し、同じレイヤー内の異なるヘッドを有効にして、重要なコンテキストを独立して保持します。
複数のベンチマークでの実験は、PLPHPがデコード速度18％を18％速くし、キー価値キャッシュ（kVキャッシュ）サイズを50％以上削減することを示しています。
– イメージタスク。
これらの結果は、細粒のトークン剪定の有効性を強調し、LVLMSの効率とスケーラビリティの進歩に貢献します。
ソースコードは公開されます。

要約(オリジナル)

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of 0.46% average performance drop, while also achieving notable performance improvements in multi-image tasks. These results highlight the effectiveness of fine-grained token pruning and contribute to advancing the efficiency and scalability of LVLMs. Our source code will be made publicly available.

arxiv情報

著者	Yu Meng,Kaiyuan Li,Chenran Huang,Chen Gao,Xinlei Chen,Yong Li,Xiaoping Zhang
発行日	2025-02-20 12:31:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー