Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

要約

音声変換 (VC) システムは音声スタイルを転送する顕著な能力を示していますが、既存の方法では依然としてピッチが不正確で、話者適応品質が低くなります。
これらの課題に対処するために、2 つの拡散モデルに基づく階層型 VC システムである Diff-HierVC を導入します。
まず、目的の音声スタイルで F0 を効果的に生成できる DiffPitch を紹介します。
続いて、生成された F0 が DiffVoice に供給され、音声が目的の音声スタイルに変換されます。
さらに、ソースフィルターエンコーダーを使用して音声のもつれを解き、変換されたメルスペクトログラムを DiffVoice のデータ駆動型事前処理として使用して、音声スタイルの転送能力を向上させます。
最後に、拡散モデルでマスクされた事前分布を使用することにより、私たちのモデルは話者適応品質を向上させることができます。
実験結果では、ピッチ生成と音声スタイルの転送パフォーマンスにおけるこのモデルの優位性が検証され、ゼロショット VC シナリオでは、このモデルは 0.83% の CER と 3.29% の EER も達成しました。

要約(オリジナル)

Although voice conversion (VC) systems have shown a remarkable ability to transfer voice style, existing methods still have an inaccurate pitch and low speaker adaptation quality. To address these challenges, we introduce Diff-HierVC, a hierarchical VC system based on two diffusion models. We first introduce DiffPitch, which can effectively generate F0 with the target voice style. Subsequently, the generated F0 is fed to DiffVoice to convert the speech with a target voice style. Furthermore, using the source-filter encoder, we disentangle the speech and use the converted Mel-spectrogram as a data-driven prior in DiffVoice to improve the voice style transfer capacity. Finally, by using the masked prior in diffusion models, our model can improve the speaker adaptation quality. Experimental results verify the superiority of our model in pitch generation and voice style transfer performance, and our model also achieves a CER of 0.83% and EER of 3.29% in zero-shot VC scenarios.

arxiv情報

著者	Ha-Yeong Choi,Sang-Hoon Lee,Seong-Whan Lee
発行日	2023-11-08 14:02:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Diff-HierVC: Diffusion-based Hierarchical Voice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー