M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

要約

会話における感情認識（ERC）は、交感神経の人間と機械の相互作用を発達させる上で非常に重要です。
会話型ビデオでは、感情は複数のモダリティ、つまりオーディオ、ビデオ、トランスクリプトに存在する可能性があります。
ただし、これらのモダリティの固有の特性により、マルチモーダルERCは常に困難な作業と見なされてきました。
既存のERCの研究は、他の2つのモダリティを無視して、主にディスカッションでのテキスト情報の使用に焦点を合わせています。
マルチモーダルアプローチを採用することで、感情認識の精度を向上させることができると期待しています。
したがって、この研究では、視覚、音声、およびテキストのモダリティから感情関連の機能を抽出するマルチモーダルフュージョンネットワーク（M2FNet）を提案します。
それは、入力データの感情に富んだ潜在的表現を組み合わせるために、マルチヘッド注意ベースの融合メカニズムを採用しています。
オーディオおよびビジュアルモダリティから潜在的な特徴を抽出するための新しい特徴抽出器を紹介します。
提案された特徴抽出器は、音声および視覚データから感情関連の特徴を学習するために、新しい適応マージンベースのトリプレット損失関数でトレーニングされています。
ERCのドメインでは、既存の方法は1つのベンチマークデータセットではうまく機能しますが、他のデータセットではうまく機能しません。
私たちの結果は、提案されたM2FNetアーキテクチャが、よく知られているMELDおよびIEMOCAPデータセットの加重平均F1スコアの点で他のすべての方法よりも優れており、ERCに新しい最先端のパフォーマンスを設定することを示しています。

要約(オリジナル)

Emotion Recognition in Conversations (ERC) is crucial in developing sympathetic human-machine interaction. In conversational videos, emotion can be present in multiple modalities, i.e., audio, video, and transcript. However, due to the inherent characteristics of these modalities, multi-modal ERC has always been considered a challenging undertaking. Existing ERC research focuses mainly on using text information in a discussion, ignoring the other two modalities. We anticipate that emotion recognition accuracy can be improved by employing a multi-modal approach. Thus, in this study, we propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality. It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data. We introduce a new feature extractor to extract latent features from the audio and visual modality. The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data. In the domain of ERC, the existing methods perform well on one benchmark dataset but not on others. Our results show that the proposed M2FNet architecture outperforms all other methods in terms of weighted average F1 score on well-known MELD and IEMOCAP datasets and sets a new state-of-the-art performance in ERC.

arxiv情報

著者	Vishal Chudasama,Purbayan Kar,Ashish Gudmalwar,Nirmesh Shah,Pankaj Wasnik,Naoyuki Onoe
発行日	2022-06-05 14:18:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

M2FNet: Multi-modal Fusion Network for Emotion Recognition in Conversation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー