Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

要約

大規模言語モデル (LLM) は自然言語処理タスクに優れていますが、その展開は、膨大なパラメーターサイズと計算需要によって制限されることがよくあります。
このペーパーでは、計算効率を高めるための LLM のポストトレーニング量子化 (PTQ)、特に 4 ビットの重みと 8 ビットのアクティベーション (W4A8) 量子化に焦点を当てます。このトピックは、重みのみの量子化に比べてあまり検討されていません。
我々は、アクティベーション量子化を意識したスケーリング (AQAS) とシーケンス長を意識したキャリブレーション (SLAC) という 2 つの革新的な技術を紹介します。これは、重みとアクティベーションに対する複合効果を考慮し、キャリブレーションシーケンスの長さをターゲットタスクに合わせることで PTQ を強化します。
さらに、小さい値がゼロに丸められる W4A8 量子化のアンダーフロー問題に対処するために、整数表現と非正規表現を組み合わせたハイブリッドデータ形式である dINT を導入します。
OPT や LLaMA を含む LLM の厳密な評価を通じて、私たちの技術がタスクの精度を完全精度モデルと同等のレベルまで大幅に向上させることを実証しました。
dINT と互換性のある算術ユニットを開発することにより、私たちの方法が 8 ビット整数 MAC ユニットと比較して 2$\times$ のハードウェア効率の向上をもたらすことをさらに確認します。

要約(オリジナル)

Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency — a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.

arxiv情報

著者	Jangwhan Lee,Minsoo Kim,Seungcheol Baek,Seok Joong Hwang,Wonyong Sung,Jungwook Choi
発行日	2023-11-09 06:19:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー