SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

要約

このペーパーでは、ツリーベースの推論と検証を行う生成大規模言語モデル (LLM) を高速化するシステムである SpecInfer を紹介します。
SpecInfer の背後にある重要なアイデアは、小規模な推測モデルを活用して LLM の出力を予測することです。
予測はトークンツリーとして編成され、そのノードのそれぞれが候補トークンシーケンスを表します。
トークンツリーで表されるすべての候補トークンシーケンスの正確性は、新しいツリーベースの並列デコードメカニズムを使用して、LLM に対して並列で検証されます。
SpecInfer は、インクリメンタルデコーダーの代わりにトークンツリー検証器として LLM を使用します。これにより、モデルの品質を維持しながら、生成 LLM を提供するためのエンドツーエンドの遅延と計算要件が大幅に削減されます。
私たちの評価によると、SpecInfer は既存の LLM サービングシステムよりも、同じ生成パフォーマンスを維持しながら、分散 LLM 推論では 1.5 ～ 2.8 倍、オフロードベースの LLM 推論では 2.6 ～ 3.5 倍優れたパフォーマンスを示します。
SpecInfer は https://github.com/flexflow/FlexFlow/ で公開されています。

要約(オリジナル)

This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/

arxiv情報

著者	Xupeng Miao,Gabriele Oliaro,Zhihao Zhang,Xinhao Cheng,Zeyu Wang,Zhengxin Zhang,Rae Ying Yee Wong,Alan Zhu,Lijie Yang,Xiaoxiang Shi,Chunan Shi,Zhuoming Chen,Daiyaan Arfeen,Reyna Abhyankar,Zhihao Jia
発行日	2024-01-23 05:02:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー