ProMoE: Fast MoE-based LLM Serving using Proactive Caching

要約

大規模な言語モデルの有望なアプリケーションは、多くの場合、エッジデバイスで利用できる GPU メモリ容量が限られているため制約を受けます。
Mixture-of-Experts (MoE) モデルは、計算中にモデルのパラメーターのサブセットのみをアクティブにし、未使用のパラメーターをホストメモリにオフロードして、全体的な GPU メモリ需要を削減することで、この問題を軽減します。
ただし、既存のキャッシュベースのオフロードソリューションはキャッシュミスを事後的に処理し、システムパフォーマンスに大きな影響を与えます。
この論文では、中間モデルの結果を活用してその後のパラメーターの使用を予測する新しいプロアクティブキャッシュシステムである ProMoE を提案します。
ProMoE は、専門家を事前にプロアクティブに呼び寄せることで、クリティカルパスからロード時間を削除し、オフロードによるパフォーマンスのオーバーヘッドを軽減します。
私たちの評価では、ProMoE が既存のオフロードソリューションと比較して、プリフィルステージとデコードステージでそれぞれ平均 2.13 倍と 2.84 倍の高速化を達成していることが実証されています。

要約(オリジナル)

The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model’s parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.

arxiv情報

著者	Xiaoniu Song,Zihang Zhong,Rong Chen
発行日	2024-10-29 15:31:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ProMoE: Fast MoE-based LLM Serving using Proactive Caching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー