Efficient Post-training Quantization with FP8 Formats

要約

LLM や拡散モデルなどの深層学習手法の最近の進歩により、精度を維持しながらこれらの最新アーキテクチャの計算需要を満たすことができる、量子化手法の改善の必要性が生じています。
この目標に向けて、私たちは、機械翻訳、言語モデリング、テキスト生成、画像分類、生成、セグメンテーションなどの幅広いタスクをカバーする 75 の独自のネットワークアーキテクチャにわたって、トレーニング後の量子化における FP8 データ形式の利点を研究しています。
3 つの異なる FP8 表現 (E5M2、E4M3、および E3M4) を調べて、ダイナミックレンジと精度の間のさまざまな程度のトレードオフがモデルの精度に及ぼす影響を研究します。
広範な調査に基づいて、さまざまなネットワークアーキテクチャ間で一般化する量子化ワークフローを開発しました。
私たちの実証結果は、FP8 形式がワークロードカバレッジ (92.64% 対 65.87%)、モデルの精度、より広範囲の操作への適合性など、さまざまな側面で INT8 よりも優れていることを示しています。
さらに、我々の調査結果は、E4M3 が NLP モデルに適しているのに対し、コンピュータービジョンタスクでは E3M4 が E4M3 よりわずかに優れたパフォーマンスを発揮することを示唆しています。
このコードは、Intel Neural Compressor (https://github.com/intel/neural-compressor) で公開されています。

要約(オリジナル)

Recent advances in deep learning methods such as LLMs and Diffusion models have created a need for improved quantization methods that can meet the computational demands of these modern architectures while maintaining accuracy. Towards this goal, we study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures covering a wide range of tasks, including machine translation, language modeling, text generation, image classification, generation, and segmentation. We examine three different FP8 representations (E5M2, E4M3, and E3M4) to study the effects of varying degrees of trade-off between dynamic range and precision on model accuracy. Based on our extensive study, we developed a quantization workflow that generalizes across different network architectures. Our empirical results show that FP8 formats outperform INT8 in multiple aspects, including workload coverage (92.64% vs. 65.87%), model accuracy and suitability for a broader range of operations. Furthermore, our findings suggest that E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks. The code is publicly available on Intel Neural Compressor: https://github.com/intel/neural-compressor.

arxiv情報

著者	Haihao Shen,Naveen Mellempudi,Xin He,Qun Gao,Chang Wang,Mengni Wang
発行日	2024-03-31 23:05:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Post-training Quantization with FP8 Formats

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー