HCAM — Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

要約

タイトル: HCAM – マルチモーダル感情認識のための階層的クロスアテンションモデル

要約:
– 談話における感情認識は、感情表現のマルチモーダル性に起因して困難である。
– HCAMアプローチを提案し、再帰型と相関ニューラルネットワークモデルの組み合わせを使用してマルチモーダル感情認識を行う。
– モデルの入力は、i）学習可能なwav2vecアプローチで処理された音声データ、およびii）transformers（BERT）モデルを使用して表されたテキストデータの2つのモダリティから成る。
– 音声とテキスト表現は、self-attentionを持つ一連の双方向再帰型ニューラルネットワーク層を使用して処理され、各発話を固定次元の埋め込みに変換する。
– コンテキスト情報と2つのモダリティ間の情報を組み込むために、音声とテキストの埋め込みは、発話レベルの埋め込みを評価して感情認識のタスクに関連するものに重みを付けるco-attentionレイヤーを使用して結合される。
– 音声層、テキスト層、およびマルチモーダルco-attention層のニューラルネットワークパラメーターは、感情分類タスクのために階層的にトレーニングされる。
– IEMOCAP、MELD、CMU-MOSIの3つの確立されたデータセットで実験を行い、提案されたモデルが他のベンチマークに比べて大幅に改善され、これらのすべてのデータセットで最先端の結果を達成することを示す。

要約(オリジナル)

Emotion recognition in conversations is challenging due to the multi-modal nature of the emotion expression. We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition using a combination of recurrent and co-attention neural network models. The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model. The audio and text representations are processed using a set of bi-directional recurrent neural network layers with self-attention that converts each utterance in a given conversation to a fixed dimensional embedding. In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer that attempts to weigh the utterance level embeddings relevant to the task of emotion recognition. The neural network parameters in the audio layers, text layers as well as the multi-modal co-attention layers, are hierarchically trained for the emotion classification task. We perform experiments on three established datasets namely, IEMOCAP, MELD and CMU-MOSI, where we illustrate that the proposed model improves significantly over other benchmarks and helps achieve state-of-art results on all these datasets.

arxiv情報

著者	Soumya Dutta,Sriram Ganapathy
発行日	2023-04-14 03:25:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

HCAM — Hierarchical Cross Attention Model for Multi-modal Emotion Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー