Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

要約

最近、2 種類の離散音声表現を組み合わせ、2 つのシーケンス間タスクを使用して TTS を分離することにより、最小限の監視でトレーニングできるテキスト音声合成 (TTS) 手法への関心が高まっています。
離散表現における高次元性と波形歪みに関連する課題に対処するために、拡散モデルに基づいてメルスペクトログラムへの意味埋め込みをモデル化し、変分オートエンコーダと韻律ボトルネックに基づいたプロンプトエンコーダ構造を導入して改善する Diff-LM-Speech を提案します。
プロンプト表現機能。
自己回帰言語モデルは、単語の欠落や重複に悩まされることがよくありますが、非自己回帰フレームワークは、継続時間予測モデルによる表現の平均化の問題に直面します。
これらの問題に対処するために、私たちは多様な韻律表現を実現する持続時間拡散モデルを設計する Tetra-Diff-Speech を提案します。
セマンティックコーディングの情報内容はテキストと音響コーディングの間であると予想されますが、既存のモデルでは、多くの冗長情報と次元爆発を伴うセマンティックコーディングが抽出されます。
セマンティックコーディングが必要ないことを検証するために、Tri-Diff-Speech を提案します。
実験結果は、私たちが提案した方法がベースライン方法よりも優れていることを示しています。
音声サンプルを提供する Web サイトを提供します。

要約(オリジナル)

Recently, there has been a growing interest in text-to-speech (TTS) methods that can be trained with minimal supervision by combining two types of discrete speech representations and using two sequence-to-sequence tasks to decouple TTS. To address the challenges associated with high dimensionality and waveform distortion in discrete representations, we propose Diff-LM-Speech, which models semantic embeddings into mel-spectrogram based on diffusion models and introduces a prompt encoder structure based on variational autoencoders and prosody bottlenecks to improve prompt representation capabilities. Autoregressive language models often suffer from missing and repeated words, while non-autoregressive frameworks face expression averaging problems due to duration prediction models. To address these issues, we propose Tetra-Diff-Speech, which designs a duration diffusion model to achieve diverse prosodic expressions. While we expect the information content of semantic coding to be between that of text and acoustic coding, existing models extract semantic coding with a lot of redundant information and dimensionality explosion. To verify that semantic coding is not necessary, we propose Tri-Diff-Speech. Experimental results show that our proposed methods outperform baseline methods. We provide a website with audio samples.

arxiv情報

著者	Chunyu Qiang,Hao Li,Hao Ni,He Qu,Ruibo Fu,Tao Wang,Longbiao Wang,Jianwu Dang
発行日	2023-07-28 11:20:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー