NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

要約

タイトル：NaturalSpeech 2：潜在的拡散モデルは自然で、ゼロショットの音声および歌の合成器である

要約：

– テキストから音声（TTS）を大規模なデータセットにスケーリングすることは、スピーカーのアイデンティティ、声調、スタイル（歌唱など）など、人間の音声の多様性を捕捉する上で重要である。
– 現在の大規模なTTSシステムは、音声を離散的なトークンに量子化し、言語モデルを使用してこれらのトークンを1つずつ生成することが一般的である。しかし、これには不安定なプロソディ、単語のスキップ/繰り返しの問題、音質の悪さなどの問題がある。
– 本論文では、潜在的な音声コーデックと残差ベクトル量子化器を利用して、量子化された潜在ベクトルを取得し、テキスト入力に応じてこれらの潜在ベクトルを生成する拡散モデルを利用したTTSシステム「NaturalSpeech 2」を開発した。
– 多様な音声合成を実現するために重要なゼロショット能力を強化するために、拡散モデルと長さ/ピッチ予測器のコンテキスト内学習を促進する音声プロンプティングメカニズムを設計した。
– 自然言語処理2を44K時間の音声および歌声データにスケーリングし、未知のスピーカーで音声の品質を評価した。 NaturalSpeech 2は、プロソディ/音色の類似性、堅牢性、および音声品質において従来のTTSシステムよりも優れた性能を持ち、音声プロンプトだけで新しいゼロショット歌声合成を実行することができます。

要約(オリジナル)

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.

arxiv情報

著者	Kai Shen,Zeqian Ju,Xu Tan,Yanqing Liu,Yichong Leng,Lei He,Tao Qin,Sheng Zhao,Jiang Bian
発行日	2023-05-04 17:08:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー