Ascendra: Dynamic Request Prioritization for Efficient LLM Serving

要約

大規模な言語モデル（LLMS）の急速な進歩により、より効率的なサービス戦略が必要になりました。
これに関連して、効率性とは、特に最初のトークン（TTFT）とトークン間の時間（TBT）までの時間の間、サービスレベルの目標（SLO）を満たすリクエストの割合を指します。
ただし、既存のシステムは、他のシステムの犠牲を払って1つのメトリックに優先順位を付けることがよくあります。
TTFTとTBTの両方のSLOを同時に満たすように設計されたLLMサービングシステムであるAscendraを提示します。
Ascendraの背後にある中心的な洞察は、リクエストの緊急性が締め切りに近づくにつれて進化するということです。
これを活用するために、AscendraはGPUリソースを2つのタイプのインスタンスに分割します：低価格と優先度。
低優先度インスタンスは、到着順序からリクエストを処理することによりスループットを最大化しますが、リクエストの飢vのリスクがあります。
これに対処するために、Ascendraはパフォーマンスモデルを採用して、スロを逃すリスクのあるリクエストを予測し、積極的に優先度の高いインスタンスに積極的にオフロードします。
優先度の高いインスタンスは、低遅延の実行のために最適化されており、締め切りに近づく緊急の要求を処理します。
この分割アーキテクチャにより、Ascendraは高スループットと低レイテンシーのバランスを効果的にバランスさせることができます。
広範な評価によると、Ascendraは、TTFTとTBTの両方のSLOを満たしている間、VLLMとSarathi-Serveに比べて最大1.7倍のシステムスループットを改善することを示しています。

要約(オリジナル)

The rapid advancement of Large Language Models (LLMs) has driven the need for more efficient serving strategies. In this context, efficiency refers to the proportion of requests that meet their Service Level Objectives (SLOs), particularly for Time To First Token (TTFT) and Time Between Tokens (TBT). However, existing systems often prioritize one metric at the cost of the other. We present Ascendra, an LLM serving system designed to meet both TTFT and TBT SLOs simultaneously. The core insight behind Ascendra is that a request’s urgency evolves as it approaches its deadline. To leverage this, Ascendra partitions GPU resources into two types of instances: low-priority and high-priority. Low-priority instances maximize throughput by processing requests out of arrival order, but at the risk of request starvation. To address this, Ascendra employs a performance model to predict requests at risk of missing their SLOs and proactively offloads them to high-priority instances. High-priority instances are optimized for low-latency execution and handle urgent requests nearing their deadlines. This partitioned architecture enables Ascendra to effectively balance high throughput and low latency. Extensive evaluation shows that Ascendra improves system throughput by up to 1.7x compared to vLLM and Sarathi-Serve while meeting both TTFT and TBT SLOs.

arxiv情報

著者	Azam Ikram,Xiang Li,Sameh Elnikety,Saurabh Bagchi
発行日	2025-04-30 14:08:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ascendra: Dynamic Request Prioritization for Efficient LLM Serving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー