ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

要約

このペーパーでは、大規模言語モデル (LLM) 用の局所性が強化されたサーバーレス推論システムである ServerlessLLM について説明します。
ServerlessLLM は、GPU サーバーで利用可能なストレージおよびメモリデバイスの実質的な容量と帯域幅を活用することで、コストのかかるリモートチェックポイントのダウンロードを削減し、効率的なチェックポイントの読み込みを実現します。
ServerlessLLM は、次の 3 つの主な貢献を通じてこれを実現します。(i) 効率的な多層チェックポイントロードシステムと組み合わせた、新しいロード最適化チェックポイント形式設計による高速 LLM チェックポイントロード。
(ii) ライブマイグレーションを使用したローカリティ駆動型の LLM 推論。これにより、ServerlessLLM は進行中の LLM 推論の低レイテンシを維持しながら、ローカリティ駆動型のサーバー割り当てを効果的に実現できます。
(iii) ローカリティを意識したサーバー割り当てにより、ServerlessLLM がクラスター内の各サーバーのステータスを評価し、ローカルチェックポイントの配置を活用するためにモデルの起動時間を効果的にスケジュールできるようになります。
マイクロベンチマークや現実世界のトレースを含む当社の包括的な実験では、ServerlessLLM がさまざまな LLM 推論ワークロードを実行する際のレイテンシパフォーマンスにおいて、最先端のシステムを 10 ～ 200 倍上回ることが示されています。

要約(オリジナル)

This paper presents ServerlessLLM, a locality-enhanced serverless inference system for Large Language Models (LLMs). ServerlessLLM exploits the substantial capacity and bandwidth of storage and memory devices available on GPU servers, thereby reducing costly remote checkpoint downloads and achieving efficient checkpoint loading. ServerlessLLM achieves this through three main contributions: (i) fast LLM checkpoint loading via a novel loading-optimized checkpoint format design, coupled with an efficient multi-tier checkpoint loading system; (ii) locality-driven LLM inference with live migration, which allows ServerlessLLM to effectively achieve locality-driven server allocation while preserving the low latency of ongoing LLM inference; and (iii) locality-aware server allocation, enabling ServerlessLLM to evaluate the status of each server in a cluster and effectively schedule model startup time to capitalize on local checkpoint placement. Our comprehensive experiments, which include microbenchmarks and real-world traces, show that ServerlessLLM surpasses state-of-the-art systems by 10 – 200X in latency performance when running various LLM inference workloads.

arxiv情報

著者	Yao Fu,Leyang Xue,Yeqi Huang,Andrei-Octavian Brabete,Dmitrii Ustiugov,Yuvraj Patel,Luo Mai
発行日	2024-01-25 17:55:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー