MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

要約

この文書では、アクティベーションを意識したエキスパートオフロードを実現する、コスト効率の高いエキスパート混合 (MoE) サービスシステムである MoE-Infinity について説明します。
MoE-Infinity は、シーケンスレベルのエキスパートアクティベーショントレースを特徴としています。これは、まばらなアクティベーションを特定し、MoE 推論の時間的局所性を捕捉することに熟達した新しいアプローチです。
これらのトレースを分析することで、MoE-Infinity は新しいアクティベーションを意識したエキスパートのプリフェッチとキャッシュを実行し、エキスパートのオフロードに通常伴う遅延オーバーヘッドを大幅に削減して、コストパフォーマンスを向上させます。
クラスタでの大規模な実験により、MoE-Infinity が多数の既存のシステムやアプローチよりも優れたパフォーマンスを発揮し、さまざまな MoE で遅延を 4 ～ 20 分の 1 に削減し、展開コストを 8 分の 1 以上削減できることが示されています。
MoE-Infinity のソースコードは https://github.com/TorchMoE/MoE-Infinity で公開されています。

要約(オリジナル)

This paper presents MoE-Infinity, a cost-efficient mixture-of-expert (MoE) serving system that realizes activation-aware expert offloading. MoE-Infinity features sequence-level expert activation tracing, a new approach adept at identifying sparse activations and capturing the temporal locality of MoE inference. By analyzing these traces, MoE-Infinity performs novel activation-aware expert prefetching and caching, substantially reducing the latency overheads usually associated with offloading experts for improved cost performance. Extensive experiments in a cluster show that MoE-Infinity outperforms numerous existing systems and approaches, reducing latency by 4 – 20X and decreasing deployment costs by over 8X for various MoEs. MoE-Infinity’s source code is publicly available at https://github.com/TorchMoE/MoE-Infinity

arxiv情報

著者	Leyang Xue,Yao Fu,Zhan Lu,Luo Mai,Mahesh Marina
発行日	2024-01-25 18:07:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー