BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

要約

大規模言語モデル (LLM) のサイズが急速に増大しているため、メモリ使用量と計算コストに大きな課題が生じています。
重みとアクティベーションの両方を量子化することでこれらの問題に対処でき、ハードウェアでサポートされるきめ細かいスケーリングが異常値を軽減する有望なソリューションとして浮上しています。
しかし、既存の方法では微妙なブロックデータ分布を捕捉するのが困難です。
我々は、より良いデータ表現のためにフォーマットブックからブロックごとに最適な数値フォーマットを割り当てるブロック単位のきめ細かい混合フォーマット技術である BlockDialect を提案します。
さらに、多様なデータ分布に適応する (方言に似た) FP4 バリアントのフォーマットブックである DialectFP4 を紹介します。
これを効率的に活用するために、オンライン DialectFP4 アクティベーション量子化のための 2 段階のアプローチを提案します。
重要なのは、DialectFP4 は、低精度の整数演算と互換性のあるスケーリングされた整数として表現可能な値を選択することにより、エネルギー効率を確保します。
BlockDialect は、MXFP4 形式と比較して、LLaMA3-8B (LLaMA2-7B) モデルで 10.78% (7.48%) の精度向上を達成し、データあたりのビット使用量が低くなり、フルパスを量子化する場合でも完全精度を下回るのはわずか 5.45% (2.69%) です。
行列の乗算。
スケーリング方法よりも表現方法に焦点を当てた私たちの研究は、エネルギー効率の高い LLM 推論のための有望な道筋を示しています。

要約(オリジナル)

The rapidly increasing size of large language models (LLMs) presents significant challenges in memory usage and computational costs. Quantizing both weights and activations can address these issues, with hardware-supported fine-grained scaling emerging as a promising solution to mitigate outliers. However, existing methods struggle to capture nuanced block data distributions. We propose BlockDialect, a block-wise fine-grained mixed format technique that assigns a per-block optimal number format from a formatbook for better data representation. Additionally, we introduce DialectFP4, a formatbook of FP4 variants (akin to dialects) that adapt to diverse data distributions. To leverage this efficiently, we propose a two-stage approach for online DialectFP4 activation quantization. Importantly, DialectFP4 ensures energy efficiency by selecting representable values as scaled integers compatible with low-precision integer arithmetic. BlockDialect achieves 10.78% (7.48%) accuracy gain on the LLaMA3-8B (LLaMA2-7B) model compared to MXFP4 format with lower bit usage per data, while being only 5.45% (2.69%) below full precision even when quantizing full-path matrix multiplication. Focusing on how to represent over how to scale, our work presents a promising path for energy-efficient LLM inference.

arxiv情報

著者	Wonsuk Jang,Thierry Tambe
発行日	2025-01-21 07:34:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BlockDialect: Block-wise Fine-grained Mixed Format Quantization for Energy-Efficient LLM Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー