HCAM — Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

要約

感情表現にはマルチモーダルな性質があるため、会話における感情認識は困難です。
リカレントニューラルネットワークモデルと同時注意ニューラルネットワークモデルの組み合わせを使用した、マルチモーダル感情認識への階層的クロスアテンションモデル (HCAM) アプローチを提案します。
モデルへの入力は 2 つのモダリティで構成されます。i) 学習可能な wav2vec アプローチを通じて処理されたオーディオデータ、および ii) トランスフォーマーからの双方向エンコーダー表現 (BERT) モデルを使用して表現されたテキストデータです。
オーディオとテキストの表現は、所定の会話内の各発話を固定次元の埋め込みに変換するセルフアテンションを備えた一連の双方向リカレントニューラルネットワークレイヤーを使用して処理されます。
2 つのモダリティにわたる文脈上の知識と情報を組み込むために、感情認識のタスクに関連する発話レベルの埋め込みの重み付けを試みる同時注意レイヤーを使用して、音声とテキストの埋め込みが結合されます。
オーディオ層、テキスト層、およびマルチモーダル同時注意層のニューラルネットワークパラメーターは、感情分類タスク用に階層的にトレーニングされます。
私たちは、IEMOCAP、MELD、CMU-MOSI という 3 つの確立されたデータセットで実験を実行し、提案されたモデルが他のベンチマークより大幅に改善され、これらすべてのデータセットで最先端の結果を達成するのに役立つことを示します。

要約(オリジナル)

Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.

arxiv情報

著者	Soumya Dutta,Sriram Ganapathy
発行日	2024-01-09 11:45:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HCAM — Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー