Adaptive Gradient Prediction for DNN Training

要約

ニューラルネットワークのトレーニングは本質的に逐次的であり、各層が順方向伝播を連続して終了し、その後、最後の層から開始して (損失関数に基づく) 勾配の計算と逆伝播が続きます。
逐次計算により、ニューラルネットワークのトレーニング、特により深いトレーニングの速度が大幅に低下します。
予測は、逐次処理を高速化するためにコンピューターアーキテクチャの多くの分野で使用され、成功しています。
そこで、勾配予測を適応的に使用して、精度を維持しながらディープニューラルネットワーク (DNN) トレーニングを高速化する ADA-GP を提案します。
ADA-GP は、DNN モデルのさまざまなレイヤーの勾配を予測するための小さなニューラルネットワークを組み込むことによって機能します。
ADA-GP は、新しいテンソル再構成を使用して、多数の勾配の予測を可能にします。
ADA-GP は、逆伝播勾配を使用した DNN トレーニングと予測勾配を使用した DNN トレーニングを交互に実行します。
ADA-GP は、精度とパフォーマンスのバランスをとるために、勾配予測をいつ、どのくらいの期間使用するかを適応的に調整します。
最後に重要なことですが、勾配予測による高速化の可能性を実現するために、一般的な DNN アクセラレーターに詳細なハードウェア拡張機能を提供します。
14 の DNN モデルを使用した広範な実験により、ADA-GP はベースラインモデルと同等かそれ以上の精度で平均 1.47 倍の速度向上を達成できることがわかりました。
さらに、ベースラインのハードウェアアクセラレータと比較して、オフチップメモリアクセスが減少するため、消費エネルギーが平均 34% 削減されます。

要約(オリジナル)

Neural network training is inherently sequential where the layers finish the forward propagation in succession, followed by the calculation and back-propagation of gradients (based on a loss function) starting from the last layer. The sequential computations significantly slow down neural network training, especially the deeper ones. Prediction has been successfully used in many areas of computer architecture to speed up sequential processing. Therefore, we propose ADA-GP, that uses gradient prediction adaptively to speed up deep neural network (DNN) training while maintaining accuracy. ADA-GP works by incorporating a small neural network to predict gradients for different layers of a DNN model. ADA-GP uses a novel tensor reorganization to make it feasible to predict a large number of gradients. ADA-GP alternates between DNN training using backpropagated gradients and DNN training using predicted gradients. ADA-GP adaptively adjusts when and for how long gradient prediction is used to strike a balance between accuracy and performance. Last but not least, we provide a detailed hardware extension in a typical DNN accelerator to realize the speed up potential from gradient prediction. Our extensive experiments with fourteen DNN models show that ADA-GP can achieve an average speed up of 1.47x with similar or even higher accuracy than the baseline models. Moreover, it consumes, on average, 34% less energy due to reduced off-chip memory accesses compared to the baseline hardware accelerator.

arxiv情報

著者	Vahid Janfaza,Shantanu Mandal,Farabi Mahmud,Abdullah Muzahid
発行日	2023-05-22 17:10:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adaptive Gradient Prediction for DNN Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー