I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

要約

ポストトレーニング量子化 (PTQ) は、大規模言語モデル (LLM) の推論を高速化する強力な手法として機能します。
それにもかかわらず、既存の作業では、追加の量子化と逆量子化、RMSNorm や Softmax などの非線形演算子など、推論中にかなりの数の浮動小数点 (FP) 演算が依然として必要となります。
この制限により、エッジデバイスやクラウドデバイスでの LLM の展開が妨げられます。
この論文では、LLM の整数のみの量子化に対する主な障害は、線形操作と非線形操作の両方でチャネルとトークンにわたるアクティベーションの大きな変動にあることを特定します。
この問題に対処するために、LLM 向けに調整された新しい整数のみの完全量子化 PTQ フレームワークである I-LLM を提案します。
具体的には、(1) すべてのアクティベーションと重みのチャネル間変動を積極的に平滑化する完全スムーズブロック再構成 (FSBR) を開発します。
(2) トークン間の変動によって引き起こされる劣化を軽減するために、Dynamic Integer-only MatMul (DI-MatMul) と呼ばれる新しいアプローチを導入します。
この方法では、整数のみの演算で入力と出力を動的に量子化することにより、全整数行列の乗算で動的量子化が可能になります。
(3) DI-ClippedSoftmax、DI-Exp、DI-Normalization を設計します。これらはビットシフトを利用して、精度を維持しながら非線形演算子を効率的に実行します。
実験では、I-LLM が FP ベースラインと同等の精度を達成し、非整数量子化手法を上回るパフォーマンスを示していることがわかります。
たとえば、I-LLM は、精度の損失を無視して W4A4 で動作できます。
私たちの知る限り、私たちは整数のみの量子化と LLM の間のギャップを埋める最初の企業です。
私たちはこの分野の進歩に貢献することを目的として、anonymous.4open.science でコードを公開しました。

要約(オリジナル)

Post-training quantization (PTQ) serves as a potent technique to accelerate the inference of large language models (LLMs). Nonetheless, existing works still necessitate a considerable number of floating-point (FP) operations during inference, including additional quantization and de-quantization, as well as non-linear operators such as RMSNorm and Softmax. This limitation hinders the deployment of LLMs on the edge and cloud devices. In this paper, we identify the primary obstacle to integer-only quantization for LLMs lies in the large fluctuation of activations across channels and tokens in both linear and non-linear operations. To address this issue, we propose I-LLM, a novel integer-only fully-quantized PTQ framework tailored for LLMs. Specifically, (1) we develop Fully-Smooth Block-Reconstruction (FSBR) to aggressively smooth inter-channel variations of all activations and weights. (2) to alleviate degradation caused by inter-token variations, we introduce a novel approach called Dynamic Integer-only MatMul (DI-MatMul). This method enables dynamic quantization in full-integer matrix multiplication by dynamically quantizing the input and outputs with integer-only operations. (3) we design DI-ClippedSoftmax, DI-Exp, and DI-Normalization, which utilize bit shift to execute non-linear operators efficiently while maintaining accuracy. The experiment shows that our I-LLM achieves comparable accuracy to the FP baseline and outperforms non-integer quantization methods. For example, I-LLM can operate at W4A4 with negligible loss of accuracy. To our knowledge, we are the first to bridge the gap between integer-only quantization and LLMs. We’ve published our code on anonymous.4open.science, aiming to contribute to the advancement of this field.

arxiv情報

著者	Xing Hu,Yuan Cheng,Dawei Yang,Zhihang Yuan,Jiangyong Yu,Chen Xu,Sifan Zhou
発行日	2024-06-05 15:26:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー