Saliency-driven Dynamic Token Pruning for Large Language Models

要約

大規模な言語モデル（LLMS）の最近の成功にもかかわらず、LLMは、注意メカニズムの二次計算の複雑さのために、長いシーケンス推論シナリオで特に困難です。
ニューラルネットワークモデルの特徴属性の解釈可能性理論に触発されて、すべてのトークンが同じ貢献をしているわけではないことを観察します。
この観察に基づいて、入力コンテキストに基づいて冗長トークンを徐々に動的にプルンするために、顕著なトークン剪定フレームワーク、すなわち顕著な動的トークン剪定（SDTP）を提案します。
具体的には、軽量の顕著性駆動型予測モジュールは、各トークンの重要なスコアを隠し状態で推定するように設計されており、LLMの異なるレイヤーに追加されて冗長トークンを階層的に剪定します。
さらに、ランキングベースの最適化戦略が提案され、顕著性スコアのランキングの相違と予測される重要性スコアを最小限に抑えることが提案されています。
広範な実験により、フレームワークはさまざまなモデルやデータセットに一般化できることが示されています。
入力トークンの65％を階層的に剪定することにより、この方法は33 \％$ \ sim $ 47 \％flopsを大幅に削減し、同等のパフォーマンスを維持しながら、推論中に最大1.75 $ \ times $を達成します。
さらに、SDTPをKVキャッシュ圧縮法と組み合わせて、さらなる圧縮を実証できることを実証します。

要約(オリジナル)

Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability theory of feature attribution in neural network models, we observe that not all tokens have the same contribution. Based on this observation, we propose a novel token pruning framework, namely Saliency-driven Dynamic Token Pruning (SDTP), to gradually and dynamically prune redundant tokens based on the input context. Specifically, a lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state, which is added to different layers of the LLM to hierarchically prune redundant tokens. Furthermore, a ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score. Extensive experiments have shown that our framework is generalizable to various models and datasets. By hierarchically pruning 65\% of the input tokens, our method greatly reduces 33\% $\sim$ 47\% FLOPs and achieves speedup up to 1.75$\times$ during inference, while maintaining comparable performance. We further demonstrate that SDTP can be combined with KV cache compression method for further compression.

arxiv情報

著者	Yao Tao,Yehui Tang,Yun Wang,Mingjian Zhu,Hailin Hu,Yunhe Wang
発行日	2025-04-09 14:36:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Saliency-driven Dynamic Token Pruning for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー