Neural Network Compression using Binarization and Few Full-Precision Weights

要約

量子化と枝刈りは、ディープニューラルネットワークモデルの 2 つの効果的な圧縮方法です。
この論文では、量子化とプルーニングを組み合わせた新しい圧縮技術である自動プルーンバイナリゼーション (APB) を提案します。
APB は、いくつかの完全精度の重みを使用してバイナリネットワークの表現能力を強化します。
私たちの技術は、各重みを 2 値化するか完全な精度を維持するかを決定することで、ネットワークの精度を最大化しつつ、メモリへの影響を最小限に抑えます。
APB をバイナリ行列と疎密行列の乗算に分解することで、APB を使用して圧縮された層を介して順方向パスを効率的に実行する方法を示します。
さらに、非常に効率的なビット単位の演算を活用して、CPU 上で高度に量子化された行列の乗算を行うための 2 つの新しい効率的なアルゴリズムを設計しました。
提案されたアルゴリズムは、利用可能な最先端のソリューションよりも 6.9 倍および 1.5 倍高速です。
私たちは、広く採用されている 2 つのモデル圧縮データセット、つまり CIFAR10 と ImageNet で APB を広範囲に評価しています。
APB は、i) 量子化、ii) プルーニング、および iii) プルーニングと量子化の組み合わせに基づく最先端の方法と比較して、より優れた精度とメモリのトレードオフを実現します。
APB は、精度と効率のトレードオフの点で量子化よりも優れており、精度を損なうことなく 2 ビット量子化モデルよりも最大 2 倍高速です。

要約(オリジナル)

Quantization and pruning are two effective Deep Neural Networks model compression methods. In this paper, we propose Automatic Prune Binarization (APB), a novel compression technique combining quantization with pruning. APB enhances the representational capability of binary networks using a few full-precision weights. Our technique jointly maximizes the accuracy of the network while minimizing its memory impact by deciding whether each weight should be binarized or kept in full precision. We show how to efficiently perform a forward pass through layers compressed using APB by decomposing it into a binary and a sparse-dense matrix multiplication. Moreover, we design two novel efficient algorithms for extremely quantized matrix multiplication on CPU, leveraging highly efficient bitwise operations. The proposed algorithms are 6.9x and 1.5x faster than available state-of-the-art solutions. We extensively evaluate APB on two widely adopted model compression datasets, namely CIFAR10 and ImageNet. APB delivers better accuracy/memory trade-off compared to state-of-the-art methods based on i) quantization, ii) pruning, and iii) combination of pruning and quantization. APB outperforms quantization in the accuracy/efficiency trade-off, being up to 2x faster than the 2-bit quantized model with no loss in accuracy.

arxiv情報

著者	Franco Maria Nardini,Cosimo Rulli,Salvatore Trani,Rossano Venturini
発行日	2023-09-15 12:13:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Neural Network Compression using Binarization and Few Full-Precision Weights

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー