DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

要約

生成大規模言語モデル（LLM）の急速な進化と広範な採用により、LLMは様々なアプリケーションにおいて極めて重要な作業負荷となっている。今日、LLM推論クラスタには、厳格なサービスレベル目標（SLO）が設定されたクエリが大量に送られてきます。望ましい性能を達成するために、これらのモデルは電力消費の激しいGPU上で実行されるため、推論クラスタは大量のエネルギーを消費し、その結果、過剰な二酸化炭素排出を引き起こす。幸いなことに、我々は、エネルギー効率を大幅に改善するために、推論計算特性の不均一性と推論ワークロードの変動を利用する大きな機会があることを発見した。しかし、このような多様で動的な環境は、異なるシステム構成（インスタンス数、モデルの並列度、GPUの周波数など）が異なるエネルギー性能トレードオフに変換される大きな探索空間を生み出す。これらの課題に対処するために、我々はLLM推論環境のための最初のエネルギー管理フレームワークであるDynamoLLMを提案する。DynamoLLMは、サービスのパフォーマンスSLOの下で、LLMサービスのエネルギーとコストを最適化するために、推論クラスタを自動的に動的に再構成します。サービスレベルでは、DynamoLLMは、レイテンシSLOを満たしながら、53%のエネルギーと38%の運用炭素排出を節約し、61%の顧客コストを削減することを示す。

要約(オリジナル)

The rapid evolution and widespread adoption of generative large language models (LLMs) have made them a pivotal workload in various applications. Today, LLM inference clusters receive a large number of queries with strict Service Level Objectives (SLOs). To achieve the desired performance, these models execute on power-hungry GPUs causing the inference clusters to consume large amount of energy and, consequently, result in excessive carbon emissions. Fortunately, we find that there is a great opportunity to exploit the heterogeneity in inference compute properties and fluctuations in inference workloads, to significantly improve energy-efficiency. However, such a diverse and dynamic environment creates a large search-space where different system configurations (e.g., number of instances, model parallelism, and GPU frequency) translate into different energy-performance trade-offs. To address these challenges, we propose DynamoLLM, the first energy-management framework for LLM inference environments. DynamoLLM automatically and dynamically reconfigures the inference cluster to optimize for energy and cost of LLM serving under the service’s performance SLOs. We show that at a service-level, DynamoLLM conserves 53% energy and 38% operational carbon emissions, and reduces 61% cost to the customer, while meeting the latency SLOs.

arxiv情報

著者	Jovan Stojkovic,Chaojie Zhang,Íñigo Goiri,Josep Torrellas,Esha Choukse
発行日	2024-08-01 17:40:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー