Dual-Stage Cross-Modal Network with Dynamic Feature Fusion for Emotional Mimicry Intensity Estimation

要約

感情的な模倣強度（EMI）推定は、人間の社会的行動を理解し、人間とコンピューターの相互作用体験を強化するための重要な技術として機能します。ここでは、コアチャレンジは動的相関モデリングとマルチモーダル時間信号の堅牢な融合にあります。
モーダルの相乗効果、ノイズ感度、限られた微調整されたアライメント機能の不十分な活用における既存の方法の制限に対処するために、このペーパーでは、デュアルステージクロスモーダルアライメントフレームワークを提案します。
まず、改良されたクリップアーキテクチャに基づいてビジョンテキストとオーディオテキストのコントラスト学習ネットワークを構築し、モダリティ分離前トレーニングを通じて機能空間で予備的な調整を実現します。
その後、一時的な畳み込みネットワーク（TCN）とゲートの双方向LSTMを組み合わせた一時的に認識された動的融合モジュールを設計して、それぞれ表情のマクロ進化パターンと音響特徴の局所的なダイナミクスをキャプチャします。
革新的には、輝く重みの割り当てを通じて、閉塞および騒々しいシナリオの下でモダリティ補正を可能にする品質誘導モダリティ融合戦略を導入します。
Hume-Vidmimic2データセットの実験結果は、私たちの方法が6つの感情次元にわたって0.35の平均ピアソン相関係数を達成し、最高のベースラインを40 \％上に上回ることを示しています。
アブレーション研究は、デュアルステージトレーニング戦略と動的融合メカニズムの有効性をさらに検証し、オープン環境で微調整された感情分析のための新しい技術的経路を提供します。

要約(オリジナル)

Emotional Mimicry Intensity (EMI) estimation serves as a critical technology for understanding human social behavior and enhancing human-computer interaction experiences, where the core challenge lies in dynamic correlation modeling and robust fusion of multimodal temporal signals. To address the limitations of existing methods in insufficient exploitation of modal synergistic effects, noise sensitivity, and limited fine-grained alignment capabilities, this paper proposes a dual-stage cross-modal alignment framework. First, we construct vision-text and audio-text contrastive learning networks based on an improved CLIP architecture, achieving preliminary alignment in the feature space through modality-decoupled pre-training. Subsequently, we design a temporal-aware dynamic fusion module that combines Temporal Convolutional Networks (TCN) and gated bidirectional LSTM to respectively capture the macro-evolution patterns of facial expressions and local dynamics of acoustic features. Innovatively, we introduce a quality-guided modality fusion strategy that enables modality compensation under occlusion and noisy scenarios through differentiable weight allocation. Experimental results on the Hume-Vidmimic2 dataset demonstrate that our method achieves an average Pearson correlation coefficient of 0.35 across six emotion dimensions, outperforming the best baseline by 40\%. Ablation studies further validate the effectiveness of the dual-stage training strategy and dynamic fusion mechanism, providing a novel technical pathway for fine-grained emotion analysis in open environments.

arxiv情報

著者	Jun Yu,Lingsi Zhu,Yanjun Chi,Yunxiang Zhang,Yang Zheng,Yongqi Wang,Xilong Lu
発行日	2025-03-14 09:55:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dual-Stage Cross-Modal Network with Dynamic Feature Fusion for Emotional Mimicry Intensity Estimation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー