Queue management for slo-oriented large language model serving

要約

大規模な言語モデル（LLM）は、クラウドプロバイダーにとってますます重要なワークロードになりつつあります。
既存のLLMサービングシステムは、ChatbotsやCoding Assistantなどのインタラクティブなリクエストに焦点を当てており、Latency SLO要件が厳しくなります。
ただし、そのようなシステムがインタラクティブな要求とともにスロをリラックスしたバッチリクエストを実行すると、マルチプレックスが不十分で非効率的なリソース利用につながります。
これらの課題に対処するために、LLMサービングのキュー管理システムであるQLMを提案します。
QLMは、リクエストキューにあるさまざまなモデルとSLOにわたってバッチおよびインタラクティブなリクエストを維持します。
リクエストキューの最適な順序付けは、高いリソースの利用を確保しながらSLOを維持するために重要です。
この最適な順序を生成するために、QLMはリクエストキューのリクエストの待機時間を推定するリクエスト待機時間（RWT）推定器を使用します。
これらの推定値は、グローバルスケジューラによって使用され、リクエストプル、リクエストの立ち退き、ロードバランス、モデルスワッピングなど、LLMサービングオペレーション（LSO）を調整します。
実際のLLMサービングデータセットを使用した不均一なGPUデバイスとモデルの評価は、QLMがSLOの達成度を40-90％改善し、他の最先端のLLMサービングと比較してデバイスの使用率を維持または改善しながら、20-400％改善することを示しています。
システム。
QLMの評価は、クラウドプロバイダーの生産要件に基づいています。
QLMはhttps://www.github.com/qlm-project/qlmで公開されています。

要約(オリジナル)

Large language model (LLM) serving is becoming an increasingly critical workload for cloud providers. Existing LLM serving systems focus on interactive requests, such as chatbots and coding assistants, with tight latency SLO requirements. However, when such systems execute batch requests that have relaxed SLOs along with interactive requests, it leads to poor multiplexing and inefficient resource utilization. To address these challenges, we propose QLM, a queue management system for LLM serving. QLM maintains batch and interactive requests across different models and SLOs in a request queue. Optimal ordering of the request queue is critical to maintain SLOs while ensuring high resource utilization. To generate this optimal ordering, QLM uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue. These estimates are used by a global scheduler to orchestrate LLM Serving Operations (LSOs) such as request pulling, request eviction, load balancing, and model swapping. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems. QLM’s evaluation is based on the production requirements of a cloud provider. QLM is publicly available at https://www.github.com/QLM-project/QLM.

arxiv情報

著者	Archit Patke,Dhemath Reddy,Saurabh Jha,Haoran Qiu,Christian Pinto,Chandra Narayanaswami,Zbigniew Kalbarczyk,Ravishankar Iyer
発行日	2025-02-25 17:54:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Queue management for slo-oriented large language model serving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー