Channel-Wise Mixed-Precision Quantization for Large Language Models

要約

大規模言語モデル(LLM)は、幅広い言語タスクで顕著な成功を収めているが、パラメータサイズが大きいため、エッジデバイスへの導入は依然として困難である。重みのみの量子化は、LLMのメモリフットプリントを削減する有望なソリューションです。しかし、既存のアプローチは、主に整数ビット量子化に焦点を当てているため、分数ビット量子化タスクへの適応性が制限され、デバイス上の利用可能なストレージ領域をフルに活用することができません。本論文では、活性化分布に基づいてチャネルごとに量子化精度を割り当てる新しい混合精度量子化手法であるチャネルワイズ混合精度量子化（CMPQ）を紹介する。異なるウェイトチャネルに異なる精度レベルを割り当てることで、CMPQはあらゆるビット幅の制約に適応することができます。CMPQは、非均一量子化戦略を採用し、2つの外れ値抽出技術を組み込むことで、重要な情報を協調的に保存し、量子化損失を最小化する。さまざまなサイズのLLMを使用した実験により、CMPQは整数ビット量子化タスクの性能を向上させるだけでなく、メモリ使用量のわずかな増加で大幅な性能向上を達成できることが実証されています。このように、CMPQはLLM量子化に対する適応的で効果的なアプローチであり、さまざまなデバイスの性能にわたって大きなメリットをもたらします。

要約(オリジナル)

Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.

arxiv情報

著者	Zihan Chen,Bike Xie,Jundong Li,Cong Shen
発行日	2024-11-01 03:16:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Channel-Wise Mixed-Precision Quantization for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー