Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

要約

自動ビデオダビング (AVD) は、スクリプトから唇の動きと顔の感情に合わせた音声を生成します。
最近の研究は、韻律表現力を高めるためにマルチモーダルな文脈をモデル化することに焦点を当てていますが、次の 2 つの重要な問題を見落としています。 1) 文脈内のマルチスケール韻律表現属性が現在の文の韻律に影響を与える。
2) 文脈内の韻律キューは現在の文と相互作用し、最終的な韻律の表現力に影響を与えます。
これらの課題に取り組むために、AVD 用のマルチスケールマルチモーダルコンテキストインタラクションスキームである M2CI-Dubber を提案します。
このスキームには、マルチスケールのマルチモーダルコンテキストをモデル化し、現在の文との深い相互作用を促進する 2 つの共有 M2CI エンコーダーが含まれています。
提案されたアプローチは、コンテキスト内の各モダリティのグローバルおよびローカルの特徴を抽出し、集約とインタラクションにアテンションベースのメカニズムを利用し、融合にインタラクションベースのグラフアテンションネットワークを採用することにより、現在の文の合成音声の韻律表現力を強化します。
Chem データセットの実験では、私たちのモデルが吹き替えの表現力においてベースラインを上回っていることが示されています。
コードとデモは \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber} で入手できます。

要約(オリジナル)

Automatic Video Dubbing (AVD) generates speech aligned with lip motion and facial emotion from scripts. Recent research focuses on modeling multimodal context to enhance prosody expressiveness but overlooks two key issues: 1) Multiscale prosody expression attributes in the context influence the current sentence’s prosody. 2) Prosody cues in context interact with the current sentence, impacting the final prosody expressiveness. To tackle these challenges, we propose M2CI-Dubber, a Multiscale Multimodal Context Interaction scheme for AVD. This scheme includes two shared M2CI encoders to model the multiscale multimodal context and facilitate its deep interaction with the current sentence. By extracting global and local features for each modality in the context, utilizing attention-based mechanisms for aggregation and interaction, and employing an interaction-based graph attention network for fusion, the proposed approach enhances the prosody expressiveness of synthesized speech for the current sentence. Experiments on the Chem dataset show our model outperforms baselines in dubbing expressiveness. The code and demos are available at \textcolor[rgb]{0.93,0.0,0.47}{https://github.com/AI-S2-Lab/M2CI-Dubber}.

arxiv情報

著者	Yuan Zhao,Rui Liu,Gaoxiang Cong
発行日	2024-12-31 07:27:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Expressive Video Dubbing with Multiscale Multimodal Context Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー