Joint Multimodal Transformer for Emotion Recognition in the Wild

要約

マルチモーダル感情認識 (MMER) のシステムは、通常、視覚、テキスト、生理学的、聴覚などのモダリティ間のモーダル間およびモーダル内の関係を活用することで、単モーダルシステムよりも優れたパフォーマンスを発揮できます。
この論文では、キーベースのクロスアテンションとの融合のために共同マルチモーダル変換器に依存する MMER 方法を提案します。
このフレームワークは、さまざまなモダリティの多様で補完的な性質を利用して、予測精度を向上させることを目的としています。
個別のバックボーンは、ビデオシーケンス上の各モダリティ内のモーダル内の時空間依存関係をキャプチャします。
その後、共同マルチモーダルトランスフォーマーフュージョンアーキテクチャによって個々のモダリティの埋め込みが統合され、モデルがモーダル間およびモーダル内の関係を効果的にキャプチャできるようになります。
2 つの困難な表情認識タスクに関する広範な実験: (1) Affwild2 データセット (顔と音声を使用) での次元感情認識、および (2) Biovid データセット (顔とバイオセンサーを使用) での痛みの推定、提案された方法が機能する可能性があることを示しています。
さまざまな方法で効果的に。
経験的な結果は、私たちが提案した融合手法を備えた MMER システムが、関連するベースライン手法や最先端の手法を上回るパフォーマンスを実現できることを示しています。

要約(オリジナル)

Systems for multimodal emotion recognition (MMER) can typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. In this paper, an MMER method is proposed that relies on a joint multimodal transformer for fusion with key-based cross-attention. This framework aims to exploit the diverse and complementary nature of different modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, a joint multimodal transformer fusion architecture integrates the individual modality embeddings, allowing the model to capture inter-modal and intra-modal relationships effectively. Extensive experiments on two challenging expression recognition tasks: (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice), and (2) pain estimation on the Biovid dataset (with face and biosensors), indicate that the proposed method can work effectively with different modalities. Empirical results show that MMER systems with our proposed fusion method allow us to outperform relevant baseline and state-of-the-art methods.

arxiv情報

著者	Paul Waligora,Haseeb Aslam,Osama Zeeshan,Soufiane Belharbi,Alessandro Lameiras Koerich,Marco Pedersoli,Simon Bacon,Eric Granger
発行日	2024-04-02 15:34:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Joint Multimodal Transformer for Emotion Recognition in the Wild

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー