Hierarchical Autoscaling for Large Language Model Serving with Chiron

要約

大規模言語モデル (LLM) の提供は、クラウドプロバイダーにとってますます重要なワークロードになっています。
パフォーマンス SLO 要件に基づいて、LLM 推論リクエストは、(a) 数秒程度の厳しい SLO を持つ対話型リクエストと、(b) 数分から数時間程度の SLO が緩和されるバッチリクエストに分類できます。
これらの SLO は、到着レート、多重化、構成パラメーターに基づいて低下する可能性があるため、サービス提供インスタンスとそのバッチサイズでリソースの自動スケーリングを使用する必要があります。
ただし、LLM サービス用の以前のオートスケーラーは、不必要なスケーリングやリソースの使用率不足につながるリクエスト SLO を考慮していません。
これらの制限に対処するために、キューサイズ、使用率、および SLO を使用して推定される階層バックプレッシャーの考え方を使用するオートスケーラーである Chiron を紹介します。
私たちの実験では、Chiron が既存のソリューションと比較して最大 90% 高い SLO 達成を達成し、GPU 効率を最大 70% 向上させることが示されています。

要約(オリジナル)

Large language model (LLM) serving is becoming an increasingly important workload for cloud providers. Based on performance SLO requirements, LLM inference requests can be divided into (a) interactive requests that have tight SLOs in the order of seconds, and (b) batch requests that have relaxed SLO in the order of minutes to hours. These SLOs can degrade based on the arrival rates, multiplexing, and configuration parameters, thus necessitating the use of resource autoscaling on serving instances and their batch sizes. However, previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization. To address these limitations, we introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs. Our experiments show that Chiron achieves up to 90% higher SLO attainment and improves GPU efficiency by up to 70% compared to existing solutions.

arxiv情報

著者	Archit Patke,Dhemath Reddy,Saurabh Jha,Chandra Narayanaswami,Zbigniew Kalbarczyk,Ravishankar Iyer
発行日	2025-01-14 12:57:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hierarchical Autoscaling for Large Language Model Serving with Chiron

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー