Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

要約

音楽と音楽からの感情の認識は、音響の重複のために類似点を共有しており、これらのドメイン間で知識を転送することに関心を持っています。
ただし、スピーチと音楽の間の共通のアコースティックキュー、特に自己監視学習（SSL）モデルによってエンコードされたものは、スピーチと音楽のSSLモデルがクロスドメインの研究ではめったに適用されないという事実を考えると、ほとんど未開拓のままです。
この作業では、音声感情認識（SER）と音楽感情認識（MER）のSSLモデルの層状動作の分析から始めて、感情のスピーチと音楽の音響の類似性を再訪します。
さらに、2段階の微調整プロセスでいくつかのアプローチを比較することにより、クロスドメインの適応を実行し、SERのために音楽を活用する効果的な方法をMERに使用します。
最後に、個々の感情のフレシェットオーディオ距離を使用して、感情的なスピーチと音楽の音楽の類似性を探り、スピーチと音楽SSLモデルの両方で感情バイアスの問題を明らかにします。
私たちの調査結果は、スピーチと音楽のSSLモデルが共有の音響的特徴をキャプチャする一方で、その行動は、トレーニング戦略とドメイン特異性により、異なる感情によって異なる場合があることが明らかになりました。
さらに、パラメーター効率の高い微調整は、互いに知識を活用することにより、SERとMERのパフォーマンスを向上させることができます。
この研究は、感情的なスピーチと音楽の間の音響的類似性に関する新しい洞察を提供し、Domainクロスの一般化がSERおよびMERシステムを改善する可能性を強調しています。

要約(オリジナル)

Emotion recognition from speech and music shares similarities due to their acoustic overlap, which has led to interest in transferring knowledge between these domains. However, the shared acoustic cues between speech and music, particularly those encoded by Self-Supervised Learning (SSL) models, remain largely unexplored, given the fact that SSL models for speech and music have rarely been applied in cross-domain research. In this work, we revisit the acoustic similarity between emotion speech and music, starting with an analysis of the layerwise behavior of SSL models for Speech Emotion Recognition (SER) and Music Emotion Recognition (MER). Furthermore, we perform cross-domain adaptation by comparing several approaches in a two-stage fine-tuning process, examining effective ways to utilize music for SER and speech for MER. Lastly, we explore the acoustic similarities between emotional speech and music using Frechet audio distance for individual emotions, uncovering the issue of emotion bias in both speech and music SSL models. Our findings reveal that while speech and music SSL models do capture shared acoustic features, their behaviors can vary depending on different emotions due to their training strategies and domain-specificities. Additionally, parameter-efficient fine-tuning can enhance SER and MER performance by leveraging knowledge from each other. This study provides new insights into the acoustic similarity between emotional speech and music, and highlights the potential for cross-domain generalization to improve SER and MER systems.

arxiv情報

著者	Yujia Sun,Zeyu Zhao,Korin Richmond,Yuanchao Li
発行日	2025-04-30 13:32:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Acoustic Similarity in Emotional Speech and Music via Self-Supervised Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー