Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling

要約

ゼロショットストリーミングテキストからスピーチは、人間コンピューターの相互作用における重要な研究トピックです。
既存の方法は、主にLookaheadメカニズムを使用しており、将来のテキストに依存して自然なストリーミング音声合成を実現し、高い処理潜時を導入します。
この問題に対処するために、高品質の音声フレームごとのフレームを生成するためのストリーミングフレームワークであるSMLLEを提案します。
SMLLEはトランスデューサーを採用してテキストをリアルタイムでセマンティックトークンに変換し、同時に持続時間アライメント情報を取得します。
次に、複合出力を完全に自己回帰（AR）ストリーミングモデルに供給して、メルスペクトルグラムを再構築します。
生成プロセスをさらに安定させるために、ARモデルが可能な限り最小限の遅延を導入する将来のテキストにアクセスできるようにする削除メカニズムを設計します。
実験結果は、SMLLEが現在のストリーミングTTSメソッドを上回り、文レベルのTTSシステムで同等のパフォーマンスを達成することを示唆しています。
サンプルは、shy-98.github.io/smlle_demo_page/で入手できます。

要約(オリジナル)

Zero-shot streaming text-to-speech is an important research topic in human-computer interaction. Existing methods primarily use a lookahead mechanism, relying on future text to achieve natural streaming speech synthesis, which introduces high processing latency. To address this issue, we propose SMLLE, a streaming framework for generating high-quality speech frame-by-frame. SMLLE employs a Transducer to convert text into semantic tokens in real time while simultaneously obtaining duration alignment information. The combined outputs are then fed into a fully autoregressive (AR) streaming model to reconstruct mel-spectrograms. To further stabilize the generation process, we design a Delete < Bos > Mechanism that allows the AR model to access future text introducing as minimal delay as possible. Experimental results suggest that the SMLLE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems. Samples are available on shy-98.github.io/SMLLE_demo_page/.

arxiv情報

著者	Haiyang Sun,Shujie Hu,Shujie Liu,Lingwei Meng,Hui Wang,Bing Han,Yifan Yang,Yanqing Liu,Sheng Zhao,Yan Lu,Yanmin Qian
発行日	2025-06-02 10:03:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー