Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

要約

最近のゼロショットテキスト（TTS）システムは一般的なジレンマに直面しています。自己回帰（AR）モデルは、生成が遅く、持続時間の制御性が欠けていますが、非自動性（NAR）モデルは時間的モデリングがなく、通常は複雑な設計が必要です。
この論文では、ARおよびNARモデリングを統合する新しい擬似アトレガレッシブ（PAR）コーデック言語モデリングアプローチを紹介します。
ARからの明示的な時間モデリングとNARからの並列生成を組み合わせることで、パルは固定された時間ステップで動的長さのスパンを生成します。
パーに基づいて、初期生成に続いてNAR洗練が続く2段階のTTSシステムであるPalleを提案します。
最初の段階では、PARは時間次元に沿って徐々に音声トークンを生成し、各ステップはすべての位置を並列で予測しますが、左端のスパンのみを保持します。
第2段階では、低自信トークンは並行して繰り返し洗練され、グローバルなコンテキスト情報を活用します。
実験は、Librittsで訓練されたPalleが、F5-TT、E2-TTS、MASKGCTを含む大規模なデータでトレーニングされた最先端のシステムを、音声品質、スピーカーの類似性、および知識性の観点からLibrispeechテストクリーンのセットで、10回の時間までに達成しながら、インテリアの速度を達成することを示しています。
オーディオサンプルはhttps://anonymous-palle.github.ioで入手できます。

要約(オリジナル)

Recent zero-shot text-to-speech (TTS) systems face a common dilemma: autoregressive (AR) models suffer from slow generation and lack duration controllability, while non-autoregressive (NAR) models lack temporal modeling and typically require complex designs. In this paper, we introduce a novel pseudo-autoregressive (PAR) codec language modeling approach that unifies AR and NAR modeling. Combining explicit temporal modeling from AR with parallel generation from NAR, PAR generates dynamic-length spans at fixed time steps. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. In the first stage, PAR progressively generates speech tokens along the time dimension, with each step predicting all positions in parallel but only retaining the left-most span. In the second stage, low-confidence tokens are iteratively refined in parallel, leveraging the global contextual information. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data, including F5-TTS, E2-TTS, and MaskGCT, on the LibriSpeech test-clean set in terms of speech quality, speaker similarity, and intelligibility, while achieving up to ten times faster inference speed. Audio samples are available at https://anonymous-palle.github.io.

arxiv情報

著者	Yifan Yang,Shujie Liu,Jinyu Li,Yuxuan Hu,Haibin Wu,Hui Wang,Jianwei Yu,Lingwei Meng,Haiyang Sun,Yanqing Liu,Yan Lu,Kai Yu,Xie Chen
発行日	2025-04-14 16:03:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー