Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

要約

会話型音声合成 (CSS) は、会話環境内で適切な韻律と感情の抑揚を備えた発話を正確に表現することを目的としています。
CSS タスクの重要性は認識していますが、先行研究では、感情的な会話データセットの不足とステートフルな感情モデリングの難しさのため、感情表現の問題については十分に調査されていませんでした。
この論文では、ECSS と呼ばれる新しい感情 CSS モデルを提案します。これには 2 つの主要コンポーネントが含まれます。 1) 感情の理解を強化するために、マルチソースの対話履歴を入力として受け取る異種グラフベースの感情コンテキストモデリングメカニズムを導入します。
対話のコンテキストをモデル化し、コンテキストから感情の手がかりを学習します。
2) 感情レンダリングを実現するために、対照学習ベースの感情レンダラーモジュールを使用して、ターゲット発話の正確な感情スタイルを推測します。
データ不足の問題に対処するために、私たちはカテゴリと強度の観点から感情ラベルを細心の注意を払って作成し、既存の会話データセット (DailyTalk) に追加の感情情報に注釈を付けます。
客観的評価と主観的評価の両方から、感情の理解と表現において、私たちのモデルがベースラインモデルよりも優れていることがわかります。
これらの評価は、包括的な感情的な注釈の重要性も強調しています。
コードとオーディオのサンプルは、https://github.com/walker-hyf/ECSS で見つけることができます。

要約(オリジナル)

Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS.

arxiv情報

著者	Rui Liu,Yifan Hu,Yi Ren,Xiang Yin,Haizhou Li
発行日	2023-12-19 08:47:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー