An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

要約

近年、Transformer ベースの言語モデルが自然言語処理タスクの標準的なアプローチになりました。
ただし、産業用アプリケーションにおけるスループットと遅延の要件が厳しいため、その採用は制限されています。
このギャップを軽減するために、構造化枝刈りなどのモデル圧縮技術が使用され、推論効率が向上しています。
ただし、既存のニューラルネットワーク推論ランタイムのほとんどは、構造化されたスパース性を適切にサポートしていません。
この論文では、重みが一定のブロックサイズで枝刈りされる、Transformer ベースの言語モデル用の効率的なスパースディープラーニング推論ソフトウェアスタックを提案します。
当社のスパースソフトウェアアクセラレータは、インテルディープラーニングブーストを利用して、CPU 上のスパース行列 – デンス行列乗算 (一般に SpMM と略される) のパフォーマンスを最大化します。
当社の SpMM カーネルは、5 つの代表的なスパース率 (70%、75%、80%、85%、90%) の下で、幅広い GEMM 形状において既存のスパースライブラリ (oneMKL、TVM、および LIBXSMM) よりも 1 桁優れたパフォーマンスを発揮します。
さらに、当社の SpMM カーネルは、業界で広く使用されている適切に最適化された高密度ライブラリである oneDNN の高密度 GEMM カーネルと比較して最大 5 倍の高速化を示しています。
当社のスパースアクセラレータは、Bert-Mini、DistilBERT、Bert-Base、BERT-Large などの広く使用されている Transformer ベースの言語モデルに適用されます。
当社のスパース推論ソフトウェアは、プロキシプロダクションレイテンシの制約の下、アマゾンウェブサービス上の Xeon 上の同じ構成で、Neural Magic の Deepsparse と比較して最大 1.5 倍の高速化を示します。
また、私たちのソリューションを 2 つのフレームワークベースの推論ソリューション、ONNX Runtime と PyTorch と比較し、レイテンシの制約の下で Xeon 上で ONNX Runtime と比べて最大 37 倍、PyTorch と比べて 345 倍高速化することを実証します。
すべてのソースコードは Github: https://github.com/intel/intel-extension-for-transformers で公開されています。

要約(オリジナル)

In recent years, Transformer-based language models have become the standard approach for natural language processing tasks. However, stringent throughput and latency requirements in industrial applications are limiting their adoption. To mitigate the gap, model compression techniques such as structured pruning are being used to improve inference efficiency. However, most existing neural network inference runtimes lack adequate support for structured sparsity. In this paper, we propose an efficient sparse deep learning inference software stack for Transformer-based language models where the weights are pruned with constant block size. Our sparse software accelerator leverages Intel Deep Learning Boost to maximize the performance of sparse matrix – dense matrix multiplication (commonly abbreviated as SpMM) on CPUs. Our SpMM kernel outperforms the existing sparse libraries (oneMKL, TVM, and LIBXSMM) by an order of magnitude on a wide range of GEMM shapes under 5 representative sparsity ratios (70%, 75%, 80%, 85%, 90%). Moreover, our SpMM kernel shows up to 5x speedup over dense GEMM kernel of oneDNN, a well-optimized dense library widely used in industry. We apply our sparse accelerator on widely-used Transformer-based language models including Bert-Mini, DistilBERT, Bert-Base, and BERT-Large. Our sparse inference software shows up to 1.5x speedup over Neural Magic’s Deepsparse under same configurations on Xeon on Amazon Web Services under proxy production latency constraints. We also compare our solution with two framework-based inference solutions, ONNX Runtime and PyTorch, and demonstrate up to 37x speedup over ONNX Runtime and 345x over PyTorch on Xeon under the latency constraints. All the source code is publicly available on Github: https://github.com/intel/intel-extension-for-transformers.

arxiv情報

著者	Haihao Shen,Hengyu Meng,Bo Dong,Zhe Wang,Ofir Zafrir,Yi Ding,Yu Luo,Hanwen Chang,Qun Gao,Ziheng Wang,Guy Boudoukh,Moshe Wasserblat
発行日	2023-06-28 23:55:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー