Interactive Multimodal Fusion with Temporal Modeling

要約

この論文では、第8回の感情的行動分析（ABAW）競争における価数覚醒（VA）の推定の方法を提示します。
当社のアプローチは、マルチモーダルフレームワークを通じて視覚情報とオーディオ情報を統合します。
視覚ブランチは、事前に訓練されたResNetモデルを使用して、顔の画像から空間的特徴を抽出します。
オーディオブランチは、事前に訓練されたVGGモデルを使用して、音声信号からVGGISHおよびLOGMEL機能を抽出します。
これらの機能は、時間的畳み込みネットワーク（TCNS）を使用して時間モデリングを受けます。
次に、クロスモーダルの注意メカニズムを適用します。ここでは、視覚的な機能がクエリキー価値の注意構造を介してオーディオ機能と相互作用します。
最後に、特徴は連結され、回帰層を通過して、価数と覚醒を予測します。
私たちの方法は、AFF-Wild2データセットで競争力のあるパフォーマンスを達成し、野生のVA推定のための効果的なマルチモーダル融合を実証します。

要約(オリジナル)

This paper presents our method for the estimation of valence-arousal (VA) in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach integrates visual and audio information through a multimodal framework. The visual branch uses a pre-trained ResNet model to extract spatial features from facial images. The audio branches employ pre-trained VGG models to extract VGGish and LogMel features from speech signals. These features undergo temporal modeling using Temporal Convolutional Networks (TCNs). We then apply cross-modal attention mechanisms, where visual features interact with audio features through query-key-value attention structures. Finally, the features are concatenated and passed through a regression layer to predict valence and arousal. Our method achieves competitive performance on the Aff-Wild2 dataset, demonstrating effective multimodal fusion for VA estimation in-the-wild.

arxiv情報

著者	Jun Yu,Yongqi Wang,Lei Wang,Yang Zheng,Shengfan Xu
発行日	2025-03-13 16:31:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Interactive Multimodal Fusion with Temporal Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー