Taming the Titans: A Survey of Efficient LLM Inference Serving

要約

生成AIの大規模な言語モデル（LLM）は、さまざまなドメインやアプリケーションで広く採用されている洗練された多用途のツールに進化し、驚くべき進歩を達成しています。
ただし、注意メカニズムの高い計算要求と組み合わせた膨大な数のパラメーターによって引き起こされる実質的なメモリオーバーヘッドは、LLM推論サービスの低レイテンシと高いスループットを達成する上で重要な課題をもたらします。
画期的な研究によって推進された最近の進歩は、この分野の進歩を大幅に加速しています。
このペーパーでは、これらの方法の包括的な調査を提供し、基本的なインスタンスレベルのアプローチ、詳細なクラスターレベルの戦略、新たなシナリオの方向性、およびその他の雑多なが重要な領域をカバーしています。
インスタンスレベルで、モデルの配置、リクエストのスケジューリング、長さの予測の解読、ストレージ管理、および分解パラダイムを確認します。
クラスターレベルでは、GPUクラスターの展開、マルチインスタンスロードバランシング、クラウドサービスソリューションを探索します。
新たなシナリオについては、特定のタスク、モジュール、および補助方法に関する議論を整理します。
全体的な概要を確保するために、いくつかのニッチでありながら重要な領域も強調しています。
最後に、LLM推論の分野をさらに前進させるための潜在的な研究方向の概要を説明します。

要約(オリジナル)

Large Language Models (LLMs) for Generative AI have achieved remarkable progress, evolving into sophisticated and versatile tools widely adopted across various domains and applications. However, the substantial memory overhead caused by their vast number of parameters, combined with the high computational demands of the attention mechanism, poses significant challenges in achieving low latency and high throughput for LLM inference services. Recent advancements, driven by groundbreaking research, have significantly accelerated progress in this field. This paper provides a comprehensive survey of these methods, covering fundamental instance-level approaches, in-depth cluster-level strategies, emerging scenario directions, and other miscellaneous but important areas. At the instance level, we review model placement, request scheduling, decoding length prediction, storage management, and the disaggregation paradigm. At the cluster level, we explore GPU cluster deployment, multi-instance load balancing, and cloud service solutions. For emerging scenarios, we organize the discussion around specific tasks, modules, and auxiliary methods. To ensure a holistic overview, we also highlight several niche yet critical areas. Finally, we outline potential research directions to further advance the field of LLM inference serving.

arxiv情報

著者	Ranran Zhen,Juntao Li,Yixin Ji,Zhenlin Yang,Tong Liu,Qingrong Xia,Xinyu Duan,Zhefeng Wang,Baoxing Huai,Min Zhang
発行日	2025-04-28 12:14:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Taming the Titans: A Survey of Efficient LLM Inference Serving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー