Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

要約

インターネットサービスの急速な成長に伴い、推奨システムは、パーソナライズされたコンテンツの提供において中心的な役割を果たします。
大規模なユーザーリクエストと複雑なモデルアーキテクチャに直面して、リアルタイム推奨システムの重要な課題は、推奨品質を犠牲にすることなく推論のレイテンシを削減し、システムスループットを増やす方法です。
このペーパーでは、モデリングおよびシステムレベルの加速と最適化戦略の組み合わせセットを提案することにより、リアルタイム設定での深い学習モデルの高い計算コストとリソースのボトルネックに対処します。
モデルレベルでは、軽量ネットワーク設計、構造化された剪定、および重量量子化を通じて、パラメーターカウントと計算要件を劇的に削減します。
システムレベルでは、複数の不均一コンピューティングプラットフォームと高性能推論ライブラリを統合し、リアルタイムの負荷特性に基づいて弾性推論スケジューリングと負荷分散メカニズムを設計します。
実験では、元の推奨精度を維持しながら、私たちの方法は、レイテンシをベースラインの30％未満、および二重システムスループット以上に削減し、大規模なオンライン推奨サービスを展開するための実用的なソリューションを提供することを示しています。

要約(オリジナル)

With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality. This paper addresses the high computational cost and resource bottlenecks of deep learning models in real-time settings by proposing a combined set of modeling- and system-level acceleration and optimization strategies. At the model level, we dramatically reduce parameter counts and compute requirements through lightweight network design, structured pruning, and weight quantization. At the system level, we integrate multiple heterogeneous compute platforms and high-performance inference libraries, and we design elastic inference scheduling and load-balancing mechanisms based on real-time load characteristics. Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput, offering a practical solution for deploying large-scale online recommendation services.

arxiv情報

著者	Junli Shao,Jing Dong,Dingzhou Wang,Kowei Shih,Dannier Li,Chengrui Zhou
発行日	2025-06-17 17:08:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー