VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

要約

モデルサイズのスケーリングは、大規模言語モデル (LLM) の展開と推論に大きな課題をもたらします。
LLM 重みの冗長性のため、最近の研究では重みのみの量子化を極めて低ビット (さらには 2 ビット) に推し進めることに焦点が当てられています。
これにより、メモリ要件が削減され、ストレージコストが最適化され、推論中に必要なメモリ帯域幅が削減されます。
ただし、数値表現の制限のため、従来のスカラーベースの重み量子化では、このような極端な低ビットを実現するのが困難です。
LLM のベクトル量子化 (VQ) に関する最近の研究では、ルックアップテーブルを使用してベクトルをインデックスに圧縮することにより、極度に低ビットのモデル量子化が可能であることが実証されました。
このペーパーでは、LLM の極低ビット量子化のためのベクトルポストトレーニング量子化 (VPTQ) を紹介します。
二次最適化を使用して LLM VQ 問題を定式化し、最適化を解くことで量子化アルゴリズムの設計をガイドします。
粒度の高い VQ に対してチャネル独立の 2 次最適化を使用して重みをさらに調整します。
さらに、最適化問題を分解することにより、簡潔で効果的なコードブック初期化アルゴリズムを提案します。
また、VPTQ を拡張して、残差と外れ値の量子化をサポートします。これにより、モデルの精度が向上し、モデルがさらに圧縮されます。
私たちの実験結果は、VPTQ が 2 ビットの SOTA と比較して、LLaMA-2 でモデルの量子化の複雑さを $0.01$ ～ $0.34$、Mistral-7B で $0.38$ ～ $0.68$、LLaMA-3 で $4.41$ ～ $7.34$ 削減することを示しています。
QA タスクでは、平均して LLaMA-2 で $0.79$ ～ $1.5\%$、Mistral-7B で $1\%$、LLaMA-3 で $11$ ～ $22\%$ の精度が向上しました。
量子化アルゴリズムの実行時間は $10.4$ ～ $18.6\%$ のみを使用するため、SOTA と比較して推論スループットが $1.6$ ～ $1.8\times$ 増加します。

要約(オリジナル)

Scaling model size significantly challenges the deployment and inference of Large Language Models (LLMs). Due to the redundancy in LLM weights, recent research has focused on pushing weight-only quantization to extremely low-bit (even down to 2 bits). It reduces memory requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit. Recent research on Vector Quantization (VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup tables. In this paper, we introduce Vector Post-Training Quantization (VPTQ) for extremely low-bit quantization of LLMs. We use Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization. We further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ. In addition, by decomposing the optimization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier quantization, which enhances model accuracy and further compresses the model. Our experimental results show that VPTQ reduces model quantization perplexity by $0.01$-$0.34$ on LLaMA-2, $0.38$-$0.68$ on Mistral-7B, $4.41$-$7.34$ on LLaMA-3 over SOTA at 2-bit, with an average accuracy improvement of $0.79$-$1.5\%$ on LLaMA-2, $1\%$ on Mistral-7B, $11$-$22\%$ on LLaMA-3 on QA tasks on average. We only utilize $10.4$-$18.6\%$ of the quantization algorithm execution time, resulting in a $1.6$-$1.8\times$ increase in inference throughput compared to SOTA.

arxiv情報

著者	Yifei Liu,Jicheng Wen,Yang Wang,Shengyu Ye,Li Lyna Zhang,Ting Cao,Cheng Li,Mao Yang
発行日	2024-09-25 16:25:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー