Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

要約

表現力豊かな Text-to-Speech (TTS) の領域では、明示的な韻律境界により、合成音声の自然さと制御可能性が大幅に向上します。
人間による韻律の注釈はパフォーマンスに大きく貢献しますが、労力と時間がかかるプロセスであり、多くの場合、一貫性のない結果が生じます。
広範な教師付きデータが利用できるにもかかわらず、現在のベンチマークモデルは依然としてパフォーマンスの低下に直面しています。
この問題に対処するために、この論文では 2 段階の自動アノテーションパイプラインが新たに提案されています。
具体的には、第 1 段階では、音声-沈黙および単語-句読点 (SSWP) ペアの対照的なテキスト音声事前トレーニングを提案します。
事前トレーニング手順は、テキストと音声の結合空間から抽出された韻律空間を強化することに重点を置きます。
第 2 段階では、マルチモーダル韻律アノテーターを構築します。これは、事前トレーニングされたエンコーダー、単純で効果的なテキスト音声特徴融合スキーム、およびシーケンス分類子で構成されます。
広範な実験により、私たちが提案した方法が韻律アノテーションの自動生成に優れ、最先端の (SOTA) パフォーマンスを達成できることが最終的に実証されました。
さらに、私たちの新しいモデルは、さまざまな量のデータを使用してテストした場合、顕著な回復力を示しました。

要約(オリジナル)

In the realm of expressive Text-to-Speech (TTS), explicit prosodic boundaries significantly advance the naturalness and controllability of synthesized speech. While human prosody annotation contributes a lot to the performance, it is a labor-intensive and time-consuming process, often resulting in inconsistent outcomes. Despite the availability of extensive supervised data, the current benchmark model still faces performance setbacks. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. Specifically, in the first stage, we propose contrastive text-speech pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs. The pretraining procedure hammers at enhancing the prosodic space extracted from joint text-speech space. In the second stage, we build a multi-modal prosody annotator, which consists of pretrained encoders, a straightforward yet effective text-speech feature fusion scheme, and a sequence classifier. Extensive experiments conclusively demonstrate that our proposed method excels at automatically generating prosody annotation and achieves state-of-the-art (SOTA) performance. Furthermore, our novel model has exhibited remarkable resilience when tested with varying amounts of data.

arxiv情報

著者	Jinzuomu Zhong,Yang Li,Hui Huang,Jie Liu,Zhiba Su,Jing Guo,Benlai Tang,Fengjie Zhu
発行日	2023-09-11 12:50:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー