Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

要約

BERT のようなモデルは、精度が高いため、識別テキストマイニングや Web 検索で広く採用されています。
ただし、大規模な BERT のようなモデルは、GPU 上で次の 2 つの問題に直面するため、非効率なオンライン推論に悩まされます。
まず、高い精度を達成するためにモデルの深度が大きくなり、GPU での逐次計算が直線的に増加します。
第 2 に、確率的および動的なオンラインワークロードにより追加コストが発生します。
このペーパーでは、BERT のようなモデルの低遅延オンライン推論のための Academus を紹介します。
Academus の中核となるのは、新しいスチューデント並列処理です。これは、ブースティングアンサンブルとスタッキング蒸留を採用して、元の深いモデルを並列スチューデントモデルと浅いスチューデントモデルの同等のグループに蒸留します。
これにより、Academus はベースラインよりも低いモデルの深さ (例: 2 レイヤー) を達成でき、その結果、精度に影響を与えることなく推論レイテンシーを最小限に抑えることができます。時折のワークロードのバーストに対しては、精度の損失を最小限に抑えながら一時的に生徒の数を減らし、スループットを向上させることができます。
さらに、学生の並列処理に特化したシステム設計を採用し、確率的なオンラインワークロードをより適切に処理します。
徹底した実験を実施し、効果を検証します。
結果は、Academus が精度を損なうことなくレイテンシーでベースラインを 4.1 ～ 1.6 倍上回り、ワークロードバーストに対して最大 22.27 倍高いスループットを達成していることを示しています。

要約(オリジナル)

Due to high accuracy, BERT-like models have been widely adopted by discriminative text mining and web searching. However, large BERT-like models suffer from inefficient online inference, as they face the following two problems on GPUs. First, they rely on the large model depth to achieve high accuracy, which linearly increases the sequential computation on GPUs. Second, stochastic and dynamic online workloads cause extra costs. In this paper, we present Academus for low-latency online inference of BERT-like models. At the core of Academus is the novel student parallelism, which adopts boosting ensemble and stacking distillation to distill the original deep model into an equivalent group of parallel and shallow student models. This enables Academus to achieve the lower model depth (e.g., two layers) than baselines and consequently the lowest inference latency without affecting the accuracy.For occasional workload bursts, it can temporarily decrease the number of students with minimal accuracy loss to improve throughput. Additionally, it employs specialized system designs for student parallelism to better handle stochastic online workloads. We conduct comprehensive experiments to verify the effectiveness. The results show that Academus outperforms the baselines by 4.1X~1.6X in latency without compromising accuracy, and achieves up to 22.27X higher throughput for workload bursts.

arxiv情報

著者	Weiyan Wang,Yilun Jin,Yiming Zhang,Victor Junqiu Wei,Han Tian,Li Chen,Kai Chen
発行日	2024-08-22 16:31:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploiting Student Parallelism for Low-latency GPU Inference of BERT-like Models in Online Services

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー