MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition

要約

顕著な進歩にもかかわらず、音声感情認識 (SER) は、特に野生の世界では、音声感情の複雑かつ曖昧な性質により依然として課題が残っています。
現在の研究は主に認識能力と一般化能力に焦点を当てていますが、私たちの研究は意味論的データの変化が存在する場合の SER 手法の信頼性に関する研究の先駆者であり、音声信号に固有のさまざまな属性をきめ細かく制御して音声感情モデリングを強化する方法を探求しています。
。
この論文では、まず、単一コーパス SER とクロスコーパス SER の両方を同時に処理できる新しい統合 SER フレームワークである MSAC-SERNet を紹介します。
具体的には、音声感情属性のみに焦点を当て、付加的なマージンソフトマックス損失に基づいて、識別的な感情表現を抽出するための新しい CNN ベースの SER モデルが提示されます。
さまざまな音声属性間の情報の重複を考慮して、さまざまな音声属性の相関関係に基づく、複数音声属性制御 (MSAC) と呼ばれる新しい学習パラダイムを提案します。これにより、提案された SER モデルは、感情の影響を軽減しながら、きめの細かい感情関連の特徴を同時に捕捉できるようになります。
感情にとらわれない表現の悪影響。
さらに、分布外検出方法を使用して MSAC-SERNet フレームワークの信頼性を調べる最初の試みを行います。
単一コーパスとクロスコーパスの両方の SER シナリオに関する実験では、MSAC-SERNet があらゆる側面でベースラインを常に上回っているだけでなく、最先端の SER アプローチと比較して優れたパフォーマンスを達成していることが示されています。

要約(オリジナル)

Despite notable progress, speech emotion recognition (SER) remains challenging due to the intricate and ambiguous nature of speech emotion, particularly in wild world. While current studies primarily focus on recognition and generalization abilities, our research pioneers an investigation into the reliability of SER methods in the presence of semantic data shifts and explores how to exert fine-grained control over various attributes inherent in speech signals to enhance speech emotion modeling. In this paper, we first introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER. Specifically, concentrating exclusively on the speech emotion attribute, a novel CNN-based SER model is presented to extract discriminative emotional representations, guided by additive margin softmax loss. Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes, termed Multiple Speech Attribute Control (MSAC), which empowers the proposed SER model to simultaneously capture fine-grained emotion-related features while mitigating the negative impact of emotion-agnostic representations. Furthermore, we make a first attempt to examine the reliability of the MSAC-SERNet framework using out-of-distribution detection methods. Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet not only consistently outperforms the baseline in all aspects, but achieves superior performance compared to state-of-the-art SER approaches.

arxiv情報

著者	Yu Pan,Yuguang Yang,Yuheng Huang,Jixun Yao,Jingjing Yin,Yanni Hu,Heng Lu,Lei Ma,Jianjun Zhao
発行日	2024-03-22 14:49:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー