Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models

要約

生成モデルの最近の進歩は、機械学習コミュニティ内で大きな関心を引き起こしています。
特に拡散モデルは、画像と音声の合成において顕著な能力を発揮しています。
Leeらによる研究など。
[19]、ブラックら。
[4]、Wang et al.
[36]、およびファンら。
[8] は、ヒューマンフィードバックによる強化学習 (RLHF) が画像合成の拡散モデルを強化できることを示しています。
ただし、これらのモデルと音声合成で使用されるモデルとのアーキテクチャの違いにより、RLHF が音声合成モデルに同様のメリットをもたらすかどうかは不明のままです。
この論文では、UTokyo-SaruLab MOS 予測システム [29] によって予測された平均オピニオンスコア (MOS) を代理損失として利用して、拡散ベースのテキスト音声合成への RLHF の実際の応用を検討します。
我々は、拡散モデル損失誘導型 RL ポリシー最適化 (DLPO) を導入し、NISQA 音声品質および自然さ評価モデル [21] とさらなる評価のための人間の好みの実験を採用して、それを他の RLHF アプローチと比較します。
私たちの結果は、RLHF が拡散ベースのテキスト音声合成モデルを強化できること、さらに DLPO が自然で高品質の音声を生成する際の拡散モデルをより良く改善できることを示しています。

要約(オリジナル)

Recent advancements in generative models have sparked significant interest within the machine learning community. Particularly, diffusion models have demonstrated remarkable capabilities in synthesizing images and speech. Studies such as those by Lee et al. [19], Black et al. [4], Wang et al. [36], and Fan et al. [8] illustrate that Reinforcement Learning with Human Feedback (RLHF) can enhance diffusion models for image synthesis. However, due to architectural differences between these models and those employed in speech synthesis, it remains uncertain whether RLHF could similarly benefit speech synthesis models. In this paper, we explore the practical application of RLHF to diffusion-based text-to-speech synthesis, leveraging the mean opinion score (MOS) as predicted by UTokyo-SaruLab MOS prediction system [29] as a proxy loss. We introduce diffusion model loss-guided RL policy optimization (DLPO) and compare it against other RLHF approaches, employing the NISQA speech quality and naturalness assessment model [21] and human preference experiments for further evaluation. Our results show that RLHF can enhance diffusion-based text-to-speech synthesis models, and, moreover, DLPO can better improve diffusion models in generating natural and high quality speech audios.

arxiv情報

著者	Jingyi Chen,Ju-Seung Byun,Micha Elsner,Andrew Perrault
発行日	2024-05-23 14:39:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Reinforcement Learning for Fine-tuning Text-to-speech Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー