Towards End-to-end 4-Bit Inference on Generative Large Language Models

要約

LLaMA や OPT などの大規模な生成モデルの推論計算の大部分は、重みとアクティベーションの両方を 4 ビットにキャストして実行でき、同時に良好な精度を維持しながら実質的な高速化につながることを示します。
これは、QUIK と呼ばれるハイブリッド量子化戦略によって実現されます。QUIK は、ほとんどの重みとアクティベーションを 4 ビットに圧縮し、一部の外れ値の重みとアクティベーションを高精度に保ちます。
重要なのは、私たちのスキームは計算効率を念頭に置いて設計されているということです。GPU カーネルに高効率のレイヤーごとのランタイムを提供することで、FP16 の実行と比較して実質的なエンドツーエンドのスループットが最大 3.1 倍向上します。
コードとモデルは https://github.com/IST-DASLab/QUIK で提供されます。

要約(オリジナル)

We show that the majority of the inference computations for large generative models such as LLaMA and OPT can be performed with both weights and activations being cast to 4 bits, in a way that leads to practical speedups while at the same time maintaining good accuracy. We achieve this via a hybrid quantization strategy called QUIK, which compresses most of the weights and activations to 4-bit, while keeping some outlier weights and activations in higher-precision. Crucially, our scheme is designed with computational efficiency in mind: we provide GPU kernels with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput improvements of up to 3.1x relative to FP16 execution. Code and models are provided at https://github.com/IST-DASLab/QUIK.

arxiv情報

著者	Saleh Ashkboos,Ilia Markov,Elias Frantar,Tingxuan Zhong,Xincheng Wang,Jie Ren,Torsten Hoefler,Dan Alistarh
発行日	2023-10-13 17:15:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards End-to-end 4-Bit Inference on Generative Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー