Versatile audio-visual learning for emotion recognition

要約

現在のオーディオビジュアル感情認識モデルのほとんどは、実際のアプリケーションでの展開に必要な柔軟性に欠けています。
私たちは、利用可能なモダリティが 1 つだけの場合でも機能し、感情属性の予測またはカテゴリ別感情の認識のいずれかに互換的に実装できるマルチモーダルシステムを構想しています。
マルチモーダル感情認識システムでこのような柔軟性を実現することは、さまざまなデータソースを正確に解釈して統合する際に固有の課題があるため、困難です。
また、回帰タスクと分類タスクを直接切り替えることを可能にしながら、欠落している情報や部分的な情報を確実に処理することも課題です。
この研究では、感情回帰または感情分類タスクのための単峰性および多峰性システムを処理するための多用途視聴覚学習 (VAVL) フレームワークを提案します。
オーディオとビジュアルのペアのデータがトレーニングセットの一部で利用できない場合（つまり、オーディオのみまたはビデオのみが存在する場合）でもトレーニングできるオーディオビジュアルフレームワークを実装します。
この効果的な表現学習は、オーディオビジュアル共有レイヤー、共有レイヤー上の残留接続、単峰性再構成タスクを使用して実現します。
私たちの実験結果は、私たちのアーキテクチャが CREMA-D、MSP-IMPROV、および CMU-MOSEI コーパスの強力なベースラインを大幅に上回るパフォーマンスを示していることを示しています。
特に、VAVL は、MSP-IMPROV コーパスの感情属性予測タスクにおいて新しい最先端のパフォーマンスを達成しています。

要約(オリジナル)

Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented interchangeably for either predicting emotional attributes or recognizing categorical emotions. Achieving such flexibility in a multimodal emotion recognition system is difficult due to the inherent challenges in accurately interpreting and integrating varied data sources. It is also a challenge to robustly handle missing or partial information while allowing direct switch between regression or classification tasks. This study proposes a versatile audio-visual learning (VAVL) framework for handling unimodal and multimodal systems for emotion regression or emotion classification tasks. We implement an audio-visual framework that can be trained even when audio and visual paired data is not available for part of the training set (i.e., audio only or only video is present). We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task. Our experimental results reveal that our architecture significantly outperforms strong baselines on the CREMA-D, MSP-IMPROV, and CMU-MOSEI corpora. Notably, VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus.

arxiv情報

著者	Lucas Goncalves,Seong-Gyun Leem,Wei-Cheng Lin,Berrak Sisman,Carlos Busso
発行日	2024-07-30 14:36:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Versatile audio-visual learning for emotion recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー