Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

要約

Swin-Transformer は、Transformer に基づく階層的な特徴表現を活用することにより、コンピュータービジョンで目覚ましい成功を収めてきました。
音声信号では、感情情報は、単語、フレーズ、発話など、音声特徴のさまざまなスケールに分散されます。
この論文では、インスピレーションをもとに、Speech Swin-Transformer と呼ばれる、音声感情認識 (SER) 用のマルチスケール感情特徴を集約するためのシフトウィンドウを備えた階層型音声 Transformer を紹介します。
具体的には、まず音声スペクトログラムを、複数のフレームパッチで構成される時間領域のセグメントレベルのパッチに分割します。
これらのセグメントレベルのパッチは、Swin ブロックのスタックを使用してエンコードされます。このブロックでは、ローカルウィンドウのトランスフォーマーを利用して、各セグメントパッチのフレームパッチ全体にわたるローカルのフレーム間の感情情報が探索されます。
その後、セグメントパッチの境界付近のパッチ相関を補償するシフトウィンドウトランスフォーマーも設計します。
最後に、パッチマージ操作を使用して、Transformer の受容野をフレームレベルからセグメントレベルに拡張することで、階層的な音声表現のためにセグメントレベルの感情的特徴を集約します。
実験結果は、私たちが提案した Speech Swin-Transformer が最先端の方法よりも優れていることを示しています。

要約(オリジナル)

Swin-Transformer has demonstrated remarkable success in computer vision by leveraging its hierarchical feature representation based on Transformer. In speech signals, emotional information is distributed across different scales of speech features, e.\,g., word, phrase, and utterance. Drawing above inspiration, this paper presents a hierarchical speech Transformer with shifted windows to aggregate multi-scale emotion features for speech emotion recognition (SER), called Speech Swin-Transformer. Specifically, we first divide the speech spectrogram into segment-level patches in the time domain, composed of multiple frame patches. These segment-level patches are then encoded using a stack of Swin blocks, in which a local window Transformer is utilized to explore local inter-frame emotional information across frame patches of each segment patch. After that, we also design a shifted window Transformer to compensate for patch correlations near the boundaries of segment patches. Finally, we employ a patch merging operation to aggregate segment-level emotional features for hierarchical speech representation by expanding the receptive field of Transformer from frame-level to segment-level. Experimental results demonstrate that our proposed Speech Swin-Transformer outperforms the state-of-the-art methods.

arxiv情報

著者	Yong Wang,Cheng Lu,Hailun Lian,Yan Zhao,Björn Schuller,Yuan Zong,Wenming Zheng
発行日	2024-01-19 07:30:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Speech Swin-Transformer: Exploring a Hierarchical Transformer with Shifted Windows for Speech Emotion Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー