ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving

要約

大規模な言語モデル（LLM）がシステムのサービスエンドポイントとしてますます展開されるため、クエリボリュームの急増は重要なスケジューリングの課題を生み出します。
既存のスケジューリングフレームワークは、主にレイテンシの最適化をターゲットにし、LLMの機能を異なるレベルのクエリに提供する機能を無視し、計算リソースの無駄につながる可能性があります。
このペーパーでは、マルチLLMサービングのための機能コスト調整されたスケジューリングフレームワークであるECCOを提案することにより、この課題に対処します。
具体的には、多目的予測子と制約付きオプティマイザーを設計することにより、2段階のスケジューリングを導入します。
予測子は、トレーニングベースと検索ベースのアプローチを通じてモデル機能と計算コストの両方を推定しますが、オプティマイザーは品質およびワークロードの制約の下でのコスト最適な割り当てを決定します。
また、サンプルごとの応答品質とコストのために収集されたデータセットであるQaServeを導入します。
広範な実験は、ECCOが成功率を6.30％改善し、既存の方法と比較して10.15％削減し、LLM応答時間の0.5％未満を消費することを示しています。
このコードは、https：//github.com/agiresearch/eccosで入手できます。

要約(オリジナル)

As large language models (LLMs) are increasingly deployed as service endpoints in systems, the surge in query volume creates significant scheduling challenges. Existing scheduling frameworks mainly target at latency optimization while neglecting the capability of LLMs to serve different level of queries, which could lead to computational resource waste. This paper addresses this challenge by proposing a capability-cost coordinated scheduling framework, ECCOS, for multi-LLM serving, which explicitly constrains response quality and workload to optimize LLM inference cost. Specifically, it introduces the two-stage scheduling by designing a multi-objective predictor and a constrained optimizer. The predictor estimates both model capabilities and computational costs through training-based and retrieval-based approaches, while the optimizer determines cost-optimal assignments under quality and workload constraints. It also introduces QAServe, a dataset collected for sample-wise response quality and costs by zero-shot prompting different LLMs on knowledge QA and mathematical reasoning. Extensive experiments demonstrate that ECCOS improves success rates by 6.30% while reducing costs by 10.15% compared to existing methods, consuming less than 0.5% of LLM response time. The code is available at: https://github.com/agiresearch/ECCOS.

arxiv情報

著者	Kai Mei,Wujiang Xu,Shuhang Lin,Yongfeng Zhang
発行日	2025-03-07 13:35:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ECCOS: Efficient Capability and Cost Coordinated Scheduling for Multi-LLM Serving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー