High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models


テキスト読み上げ (TTS) 手法は、音声クローン作成において有望な結果を示していますが、ラベル付きのテキストと音声のペアが多数必要です。
最小限の教師付き音声合成は、2 種類の離散音声表現 (意味論的および音響) を組み合わせ、2 つのシーケンス間タスクを使用することで TTS を切り離し、最小限の監視でトレーニングできるようにします。
Contrastive Token-Acoustic Pretraining (CTAP) は、既存のセマンティック コーディング手法における情報の冗長性と次元爆発の問題を解決するための中間セマンティック表現として使用されます。


Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic \& acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.


著者 Chunyu Qiang,Hao Li,Yixin Tian,Yi Zhao,Ying Zhang,Longbiao Wang,Jianwu Dang
発行日 2023-09-27 09:27:03+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

カテゴリー: cs.AI, cs.CL, cs.SD, eess.AS パーマリンク