ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

要約

大規模言語モデル (LLM) の高い電力消費と遅延に敏感な展開により、量子化やスパース性などの効率化手法が推進されます。
LLM からアテンションヘッドまたはニューロンを永久に削除すると精度が大幅に低下する可能性があるため、スパースパターンが入力に依存するコンテキストスパース性は、LLM において非常に重要です。
これまでの研究では、活性化規模を予測するように訓練されたニューラルネットワークを使用してコンテキストスパース性をモデル化することが試みられており、これを使用して、予測される活性化規模が低い構造を動的に刈り取ることができます。
この論文では、大きさに基づく枝刈り基準を超えて、LLM におけるアテンションヘッドとニューロンの重要性を評価します。
私たちは、ShadowLLM と呼ばれる新しい予測子を開発しました。これは、LLM の動作をシャドーイングし、より優れたスパースパターンを強制することができ、その結果、従来の方法と比較してエンドツーエンドの精度が 15% 以上向上しました。
さらに、ShadowLLM は、最先端の DejaVu フレームワークと比較して最大 20% の高速化を実現します。
これらの機能強化は、最大 300 億のパラメータを持つ Llama-2 および OPT モデルで検証されています。
コードは \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM} で入手できます。

要約(オリジナル)

The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated efficiency techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We develop a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy compared to prior methods. In addition, ShadowLLM achieves up to a 20% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on Llama-2 and OPT models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.

arxiv情報

著者	Yash Akhauri,Ahmed F AbouElhamayed,Jordan Dotzel,Zhiru Zhang,Alexander M Rush,Safeen Huda,Mohamed S Abdelfattah
発行日	2024-10-17 15:45:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー