Creating New Voices using Normalizing Flows

要約

トレーニング中に目に見えない音声アイデンティティにとって、リアルで自然な響きの合成音声を作成することは依然として大きな課題です。
新しい話者の音声を合成することへの関心が高まっているため、ここでは、テキスト読み上げ (TTS) および音声変換 (VC) モードでフローを正規化し、トレーニング中に観察された話者から外挿して、目に見えない話者のアイデンティティを作成する機能を調査します。
まず、TTS と VC のアプローチを作成し、次に、明瞭さ、自然さ、話者の類似性、新しい音声を作成する能力の観点から、メソッドとベースラインを総合的に評価します。
私たちは客観的指標と主観的指標の両方を使用して、ゼロショットと新しい音声音声合成という 2 つの評価タスクに関する技術のベンチマークを行います。
前者のタスクの目的は、目に見えない音声への変換の精度を測定することです。
後者の目的は、新しい声を生み出す能力を測定することです。
広範な評価により、提案されたアプローチにより、ゼロショット音声合成で最先端のパフォーマンスを体系的に取得できるようになり、トレーニングセットでは観察されなかったさまざまな新しい音声が作成されることが実証されました。
私たちは、この研究が、TTS モードと VC モードの包括的な分析と比較とともに、メルスペクトログラムと正規化フローに基づいて新しい音声を合成する最初の試みであると考えています。

要約(オリジナル)

Creating realistic and natural-sounding synthetic speech remains a big challenge for voice identities unseen during training. As there is growing interest in synthesizing voices of new speakers, here we investigate the ability of normalizing flows in text-to-speech (TTS) and voice conversion (VC) modes to extrapolate from speakers observed during training to create unseen speaker identities. Firstly, we create an approach for TTS and VC, and then we comprehensively evaluate our methods and baselines in terms of intelligibility, naturalness, speaker similarity, and ability to create new voices. We use both objective and subjective metrics to benchmark our techniques on 2 evaluation tasks: zero-shot and new voice speech synthesis. The goal of the former task is to measure the precision of the conversion to an unseen voice. The goal of the latter is to measure the ability to create new voices. Extensive evaluations demonstrate that the proposed approach systematically allows to obtain state-of-the-art performance in zero-shot speech synthesis and creates various new voices, unobserved in the training set. We consider this work to be the first attempt to synthesize new voices based on mel-spectrograms and normalizing flows, along with a comprehensive analysis and comparison of the TTS and VC modes.

arxiv情報

著者	Piotr Bilinski,Thomas Merritt,Abdelhamid Ezzerg,Kamil Pokora,Sebastian Cygert,Kayoko Yanagisawa,Roberto Barra-Chicote,Daniel Korzekwa
発行日	2023-12-22 10:00:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Creating New Voices using Normalizing Flows

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー