StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

要約

この論文では、スタイルの拡散と大規模音声言語モデル (SLM) による敵対的トレーニングを利用して人間レベルの TTS 合成を実現する Text-to-Speech (TTS) モデルである StyleTTS 2 を紹介します。
StyleTTS 2 は、拡散モデルを通じてスタイルを潜在確率変数としてモデル化し、参照音声を必要とせずにテキストに最適なスタイルを生成することで前任者と異なり、拡散モデルによって提供される多様な音声合成の恩恵を受けながら、効率的な潜在拡散を実現します。
さらに、WavLM などの事前にトレーニングされた大規模な SLM を、エンドツーエンドのトレーニング用の新しい微分可能な期間モデリングを備えた弁別器として採用し、その結果、音声の自然さが向上します。
StyleTTS 2 は、シングルスピーカーの LJSpeech データセットでの人間の録音を上回り、英語のネイティブスピーカーによる判断でマルチスピーカーの VCTK データセットと一致します。
さらに、LibriTTS データセットでトレーニングすると、私たちのモデルは、ゼロショットスピーカー適応に関して、以前に公開されていたモデルよりも優れたパフォーマンスを発揮します。
この研究は、単一話者データセットと複数話者データセットの両方で初の人間レベルの TTS を達成し、大規模な SLM によるスタイルの拡散と敵対的トレーニングの可能性を示しています。
オーディオデモとソースコードは https://styletts2.github.io/ で入手できます。

要約(オリジナル)

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

arxiv情報

著者	Yinghao Aaron Li,Cong Han,Vinay S. Raghavan,Gavin Mischler,Nima Mesgarani
発行日	2023-11-20 04:23:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー