LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

要約

音声からスピーチのダイアログシステムの最近の進歩は、マルチモーダルの相互作用のためにLLMを活用していますが、それらは微調整要件、高い計算オーバーヘッド、およびテキストスピーチの不整列によって妨げられたままです。
既存の音声対応LLMは、LLMを変更することにより会話の品質を低下させることが多く、それにより言語能力が損なわれます。
対照的に、ベースLLMの機能を完全に保存しながら、低レイテンシで高品質の音声を生成する軽量の30mパラメーター、LLMに依存しない、自己網膜ストリーミングTTSシステムであるLLMVoxを提案します。
私たちのアプローチは、同等のレイテンシとUTMOSスコアで動作しながら、音声対応LLMSと比較して大幅に低い単語エラー率を達成します。
LLMVoxは、マルチキュートークンストリーミングシステムを介してLLM処理からの音声合成を切り離すことにより、シームレスで無限の長さの対話をサポートします。
そのプラグアンドプレイデザインは、異なるバックボーンを持つさまざまなタスクの拡張も容易にします。
さらに、LLMVoxは、データセットの適応のみを備えた新しい言語に一般化し、アラビア語の音声タスクで文字エラー率が低くなります。
さらに、LLMVoxをビジョン言語モデルと統合して、追加のマルチモーダルトレーニングを必要とせずに、音声、テキスト、視覚機能を備えたOmni-Modelを作成しました。
当社のコードベースとプロジェクトページは、https：//mbzuai-oryx.github.io/llmvoxで入手できます。

要約(オリジナル)

Recent advancements in speech-to-speech dialogue systems leverage LLMs for multimodal interactions, yet they remain hindered by fine-tuning requirements, high computational overhead, and text-speech misalignment. Existing speech-enabled LLMs often degrade conversational quality by modifying the LLM, thereby compromising its linguistic capabilities. In contrast, we propose LLMVoX, a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming TTS system that generates high-quality speech with low latency, while fully preserving the capabilities of the base LLM. Our approach achieves a significantly lower Word Error Rate compared to speech-enabled LLMs, while operating at comparable latency and UTMOS score. By decoupling speech synthesis from LLM processing via a multi-queue token streaming system, LLMVoX supports seamless, infinite-length dialogues. Its plug-and-play design also facilitates extension to various tasks with different backbones. Furthermore, LLMVoX generalizes to new languages with only dataset adaptation, attaining a low Character Error Rate on an Arabic speech task. Additionally, we have integrated LLMVoX with a Vision-Language Model to create an omni-model with speech, text, and vision capabilities, without requiring additional multimodal training. Our code base and project page is available at https://mbzuai-oryx.github.io/LLMVoX .

arxiv情報

著者	Sambal Shikhar,Mohammed Irfan Kurpath,Sahal Shaji Mullappilly,Jean Lahoud,Fahad Khan,Rao Muhammad Anwer,Salman Khan,Hisham Cholakkal
発行日	2025-03-06 18:59:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー