‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization

要約

量子化は、大規模な言語モデル（LLM）の推論を加速するための強力なツールですが、異なる形式にわたる精度とパフォーマンスのトレードオフは不明のままです。
この論文では、これまでで最も包括的な経験的研究を実施し、Llama-3.1モデルファミリー全体でアカデミックベンチマークと現実世界のタスク全体でFP8、INT8、およびINT4の量子化を評価します。
500,000を超える評価を通じて、私たちの調査により、いくつかの重要な調査結果が得られます。（1）FP8（W8A8-FP）はすべてのモデルスケールにわたって事実上ロスレスであり、（2）十分に調整されたINT8（W8A8-INT）は、驚くほど低い（1-3 \％）精度の劣化を達成し、（3）INT4重量（W4A16-INT）が競合します。
さらに、人気のあるVLLMフレームワークを通じて推論パフォーマンスを分析することにより、さまざまな展開の最適な量子化形式を調査します。
分析は明確な展開の推奨事項を提供します：W4A16は同期セットアップに最も費用対効果が高いのに対し、W8A8は非同期連続バッチングで支配的です。
混合ワークロードの場合、最適な選択は特定のユースケースに依存します。
私たちの調査結果は、量子化されたLLMを大規模に展開するための実用的なデータ駆動型ガイドラインを提供します。これは、速度、効率、精度の最良のバランスを確保します。

要約(オリジナル)

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale — ensuring the best balance between speed, efficiency, and accuracy.

arxiv情報

著者	Eldar Kurtic,Alexandre Marques,Shubhra Pandit,Mark Kurtz,Dan Alistarh
発行日	2025-05-30 17:39:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

‘Give Me BF16 or Give Me Death’? Accuracy-Performance Trade-Offs in LLM Quantization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー