Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

要約

人間は、強勢の配置や感情の強さなど、さまざまな韻律属性を簡単に変更して、一貫した言語内容を維持しながら特定の感情を伝えることができます。
この機能を動機として、感情表現を強化し、音声感情認識タスクにおけるデータ不足の問題に取り組むように設計された新しいスタイルの伝達モデルである EmoAug を提案します。
EmoAug は、言語情報と非言語情報をそれぞれ表す意味エンコーダーとパラ言語エンコーダーで構成されます。
さらに、デコーダは、教師なしの方法で前述の 2 つの情報フローを条件付けすることによって音声信号を再構築します。
トレーニングが完了すると、EmoAug は、さまざまなスタイルをパラ言語エンコーダーに入力することで、強勢、リズム、強さなどのさまざまな韻律属性を備えた感情的なスピーチの表現を豊かにします。
EmoAug を使用すると、クラスごとに同様の数のサンプルを生成して、データの不均衡の問題にも取り組むことができます。
IEMOCAP データセットの実験結果は、EmoAug が話者のアイデンティティと意味論的な内容を保持しながら、さまざまな話し方をうまく転送できることを示しています。
さらに、EmoAug によって拡張されたデータを使用して SER モデルをトレーニングし、拡張モデルが最先端の教師あり手法および自己教師あり手法を上回るだけでなく、データの不均衡によって引き起こされる過剰適合の問題も克服できることを示します。
一部の音声サンプルはデモ Web サイトでご覧いただけます。

要約(オリジナル)

Humans can effortlessly modify various prosodic attributes, such as the placement of stress and the intensity of sentiment, to convey a specific emotion while maintaining consistent linguistic content. Motivated by this capability, we propose EmoAug, a novel style transfer model designed to enhance emotional expression and tackle the data scarcity issue in speech emotion recognition tasks. EmoAug consists of a semantic encoder and a paralinguistic encoder that represent verbal and non-verbal information respectively. Additionally, a decoder reconstructs speech signals by conditioning on the aforementioned two information flows in an unsupervised fashion. Once training is completed, EmoAug enriches expressions of emotional speech with different prosodic attributes, such as stress, rhythm and intensity, by feeding different styles into the paralinguistic encoder. EmoAug enables us to generate similar numbers of samples for each class to tackle the data imbalance issue as well. Experimental results on the IEMOCAP dataset demonstrate that EmoAug can successfully transfer different speaking styles while retaining the speaker identity and semantic content. Furthermore, we train a SER model with data augmented by EmoAug and show that the augmented model not only surpasses the state-of-the-art supervised and self-supervised methods but also overcomes overfitting problems caused by data imbalance. Some audio samples can be found on our demo website.

arxiv情報

著者	Leyuan Qu,Wei Wang,Cornelius Weber,Pengcheng Yue,Taihao Li,Stefan Wermter
発行日	2023-12-28 11:09:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー