Distributed Speculative Inference of Large Language Models

要約

大規模言語モデル (LLM) の推論を高速化することは、人工知能における重要な課題です。
この論文では、分散投機推論 (DSI) を紹介します。これは、投機推論 (SI) [leviathan2023fast、chen2023accelerated、miao2023specinfer] や従来の自己回帰推論 (非 SI) よりも高速であることが証明されている新しい分散推論アルゴリズムです。
他の SI アルゴリズムと同様に、DSI は凍結された LLM 上で動作し、トレーニングやアーキテクチャの変更を必要とせず、ターゲットの分布を保持します。
SI に関する以前の研究では、(非 SI と比較して) 高速化が実証されていますが、高速で正確なドラフター LLM が必要です。
実際には、既製の LLM には、十分に高速で正確な対応するドラフターが存在しないことがよくあります。
ギャップが見られます。低速または精度の低いドラフターを使用すると、SI は非 SI よりも遅くなります。
私たちは、起草者が指定した場合、DSI が SI および非 SI の両方よりも高速であることを証明することで、このギャップを埋めます。
ターゲットとドラフターの複数のインスタンスを調整することにより、DSI は SI より高速なだけでなく、SI では高速化できない LLM もサポートします。
私たちのシミュレーションでは、現実的な設定での既製 LLM の高速化が示されています。DSI は SI よりも 1.29 ～ 1.92 倍高速です。

要約(オリジナル)

Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.

arxiv情報

著者	Nadav Timor,Jonathan Mamou,Daniel Korat,Moshe Berchansky,Oren Pereg,Moshe Wasserblat,Tomer Galanti,Michal Gordon,David Harel
発行日	2024-06-28 15:34:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distributed Speculative Inference of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー