FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

要約

大規模言語モデル(LLM)のファインチューニングはタスク適応に不可欠であるが、今日のサービングスタックは推論とファインチューニングを別々のGPUクラスタに分離しているため、リソースを浪費し、ハードウェアを十分に活用していない。FlexLLMは、トークン・レベルでの計算を融合することで、LLM推論とPEFTベースのファインチューニングを共有GPU上で共存させる初めてのシステムです。FlexLLMの静的コンパイル最適化–依存並列化とグラフ刈り込み–は、活性化メモリを大幅に縮小し、エンドツーエンドで最大80%のGPUメモリ節約につながる。実行時には、ハイブリッドトークンスケジューラと組み合わされた新しいトークンレベルの微調整メカニズムが、推論トークンとトレーニングトークンを各共役反復内で動的にインターリーブし、利用率を最大化しながら厳しいレイテンシSLOを満たします。LLaMA-3.1-8B、Qwen-2.5-14B、Qwen-2.5-32Bのエンドツーエンドベンチマークにおいて、FlexLLMは推論SLO要件を20req/sまで維持し、高推論負荷時には1.9-4.8倍、軽負荷時には2.5-6.8倍の微調整スループットを向上させ、ピーク時においてもピーク時の微調整進捗の76%以上を維持した。FlexLLMのソースコードはhttps://github.com/flexflow/FlexFlow/。

要約(オリジナル)

Finetuning large language models (LLMs) is essential for task adaptation, yet serving stacks today isolate inference and finetuning on separate GPU clusters — wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. The static compilation optimizations in FlexLLM — dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM sustains the inference SLO requirements up to 20 req/s, and improves finetuning throughput by 1.9-4.8x under heavy inference workloads and 2.5-6.8x under light loads, preserving over 76% of peak finetuning progress even at peak demand. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow/.

arxiv情報

著者	Gabriele Oliaro,Xupeng Miao,Xinhao Cheng,Vineeth Kada,Ruohan Gao,Yingyi Huang,Remi Delacourt,April Yang,Yingcheng Wang,Mengdi Wu,Colin Unger,Zhihao Jia
発行日	2025-05-02 15:56:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー