PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

要約

大規模な言語モデル（LLMS）は、通常、テキストツーチック（TTS）システムと組み合わせてオーディオ応答を生成するリアルタイムの音声チャットアプリケーションで広く使用されています。
ただし、サイズが大きいと、ユーザー入力の終了とオーディオ出力の開始との間に顕著なレイテンシが発生し、ユーザーの体験が最適です。
この遅延は、LLMが限られたコンピューティング容量を備えた消費者グレードのハードウェアにシングルユーザー音声アシスタントとして展開されている場合に特に明白です。
この遅延は、LLMSが最初の文を生成するのにかかる時間によって主に支配されることを発見しました。これは、文ごとのオーディオ応答を合成するTTSシステムによる入力として必要です。
このボトルネックに対処するために、予測生成（Predgen）を提案します。これは、入力時に投機的なデコードを緩和する、またはこの遅延を排除する新しいフレームワークです。
Predgenは、ユーザーがまだ話している間に候補の応答を生成し、システムが最小限の遅延でTTS処理を開始できるようにします。
LMSYSおよびMTベンチデータセットでのシミュレートされた実験は、提案された方法が広範囲のユースケースでレイテンシを約2倍に効果的に減らすことができることを示していますが、そうでなければ未使用の入力時間コンピュータットで最小限の追加計算コストのみが発生します。

要約(オリジナル)

Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

arxiv情報

著者	Shufan Li,Aditya Grover
発行日	2025-06-18 15:29:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー