NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

要約

最近の大規模な音声合成 (TTS) モデルは大幅な進歩を遂げていますが、音声の品質、類似性、韻律の点ではまだ不十分です。
音声には、生成に重大な課題をもたらすさまざまな属性 (内容、韻律、音色、音響の詳細など) が複雑に含まれていることを考慮すると、音声をさまざまな属性を表す個々の部分空間に因数分解し、個別に生成するというのが自然なアイデアです。
これに動機づけられて、私たちは、ゼロショット方式で自然な音声を生成する新しい因数分解拡散モデルを備えた TTS システムである NaturalSpeech 3 を提案します。
具体的には、1) 因数分解ベクトル量子化 (FVQ) を使用してニューラルコーデックを設計し、音声波形を内容、韻律、音色、および音響の詳細の部分空間に分解します。
2) 対応するプロンプトに従って各部分空間に属性を生成する因数分解拡散モデルを提案します。
この因数分解設計を使用すると、NaturalSpeech 3 は分割統治方式で、もつれの解けた部分空間を持つ複雑な音声を効果的かつ効率的にモデル化できます。
実験の結果、NaturalSpeech 3 は、品質、類似性、韻律、明瞭度の点で最先端の TTS システムよりも優れており、人間の録音と同等の品質を達成していることが示されています。
さらに、1B パラメーターと 20 万時間のトレーニングデータに拡張することで、より優れたパフォーマンスを実現します。

要約(オリジナル)

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

arxiv情報

著者	Zeqian Ju,Yuancheng Wang,Kai Shen,Xu Tan,Detai Xin,Dongchao Yang,Yanqing Liu,Yichong Leng,Kaitao Song,Siliang Tang,Zhizheng Wu,Tao Qin,Xiang-Yang Li,Wei Ye,Shikun Zhang,Jiang Bian,Lei He,Jinyu Li,Sheng Zhao
発行日	2024-03-27 16:14:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー