E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity

要約

従来の枝刈り手法は、費用がかからないトレーニングプロセスと大量の計算要求のため、生成 AI の大規模言語モデル (LLM) での作業が困難であることが知られています。
初めて、隠れ状態特徴の情報エントロピーを枝刈りメトリック設計、つまり E-Sparse に導入し、LLM 上の N:M スパース性の精度を向上させます。
E-Sparse は、豊富な情報を利用してチャネルの重要性を活用し、さらにそれを実現するためにいくつかの新しい技術を組み込んでいます。(1) 情報エントロピーを導入して、パラメータの重みと入力特徴ノルムの重要性を新しい枝刈りメトリクスとして強化します。
残りの重みを変更せずに N:M スパース性を実行します。
(2) 情報分散を迅速に最適化し、LLM の精度に対する N:M スパース性の影響に適切に対処するために、グローバルナイーブシャッフルとローカルブロックシャッフルを設計します。
E-Sparse は FasterTransformer 上の Sparse-GEMM として実装され、NVIDIA Ampere GPU 上で実行されます。
LLaMA ファミリと OPT モデルに関する広範な実験により、E-Sparse は、精度の損失を許容できる範囲で、密なモデルに比べてモデル推論を大幅に高速化し (最大 1.53 倍)、大幅なメモリ節約 (最大 43.52%) が得られることが示されています。

要約(オリジナル)

Traditional pruning methods are known to be challenging to work in Large Language Models (LLMs) for Generative AI because of their unaffordable training process and large computational demands. For the first time, we introduce the information entropy of hidden state features into a pruning metric design, namely E-Sparse, to improve the accuracy of N:M sparsity on LLM. E-Sparse employs the information richness to leverage the channel importance, and further incorporates several novel techniques to put it into effect: (1) it introduces information entropy to enhance the significance of parameter weights and input feature norms as a novel pruning metric, and performs N:M sparsity without modifying the remaining weights. (2) it designs global naive shuffle and local block shuffle to quickly optimize the information distribution and adequately cope with the impact of N:M sparsity on LLMs’ accuracy. E-Sparse is implemented as a Sparse-GEMM on FasterTransformer and runs on NVIDIA Ampere GPUs. Extensive experiments on the LLaMA family and OPT models show that E-Sparse can significantly speed up the model inference over the dense model (up to 1.53X) and obtain significant memory saving (up to 43.52%), with acceptable accuracy loss.

arxiv情報

著者	Yun Li,Lin Niu,Xipeng Zhang,Kai Liu,Jianchen Zhu,Zhanhui Kang
発行日	2024-03-22 09:18:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー