Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

要約

この論文では、一般的な大規模事前学習モデル (PTM) を音声感情認識タスクに適応させるパラダイムを紹介します。
PTM は汎用人工知能に新たな光を当てますが、一般的なタスクを念頭に置いて構築されているため、特定のタスクに対する有効性をさらに向上させることができます。
さらに、PTM はサイズが大きいため、実際のアプリケーションで使用するのは困難な場合があります。
上記の制限により、別の研究方向が生まれました。それは、特定のタスクに合わせて大規模な PTM を最適化し、コンパクトかつ効果的なタスク固有の PTM を生成することです。
この論文では、音声感情認識タスクに焦点を当て、Vesper と呼ばれる改良された感情固有の事前トレーニング済みエンコーダーを提案します。
Vesper は、WavLM に基づく音声データセットで事前トレーニングされており、感情的特徴が考慮されています。
感情情報に対する感度を高めるために、Vesper は感情に基づくマスキング戦略を採用して、マスキングが必要な領域を特定します。
その後、Vesper は階層的およびクロスレイヤーの自己監視を採用して、感情認識にとって重要な音響表現と意味表現を捕捉する能力を向上させています。
IEMOCAP、MELD、および CREMA-D データセットの実験結果は、4 レイヤーの Vesper が 12 レイヤーの WavLM Base よりも優れており、12 レイヤーの Vesper のパフォーマンスが 24 レイヤーの WavLM Large のパフォーマンスを上回っていることを示しています。

要約(オリジナル)

This paper presents a paradigm that adapts general large-scale pretrained models (PTMs) to speech emotion recognition task. Although PTMs shed new light on artificial general intelligence, they are constructed with general tasks in mind, and thus, their efficacy for specific tasks can be further improved. Additionally, employing PTMs in practical applications can be challenging due to their considerable size. Above limitations spawn another research direction, namely, optimizing large-scale PTMs for specific tasks to generate task-specific PTMs that are both compact and effective. In this paper, we focus on the speech emotion recognition task and propose an improved emotion-specific pretrained encoder called Vesper. Vesper is pretrained on a speech dataset based on WavLM and takes into account emotional characteristics. To enhance sensitivity to emotional information, Vesper employs an emotion-guided masking strategy to identify the regions that need masking. Subsequently, Vesper employs hierarchical and cross-layer self-supervision to improve its ability to capture acoustic and semantic representations, both of which are crucial for emotion recognition. Experimental results on the IEMOCAP, MELD, and CREMA-D datasets demonstrate that Vesper with 4 layers outperforms WavLM Base with 12 layers, and the performance of Vesper with 12 layers surpasses that of WavLM Large with 24 layers.

arxiv情報

著者	Weidong Chen,Xiaofen Xing,Peihao Chen,Xiangmin Xu
発行日	2024-04-18 13:08:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vesper: A Compact and Effective Pretrained Model for Speech Emotion Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー