GMTalker: Gaussian Mixture-based Audio-Driven Emotional talking video Portraits

要約

音声リップシンク、鮮やかな表情、リアルな頭のポーズ、まばたきなどを備えた、高忠実度で感情制御可能なトーキングビデオポートレートを合成することは、近年重要かつ困難な課題となっています。
既存の方法のほとんどは、パーソナライズされた正確な感情制御、異なる感情状態間のスムーズな移行、および多様なモーションの生成を実現するのに苦労しています。
これらの課題に取り組むために、ガウス混合ベースの感情的な会話ポートレート生成フレームワークである GMTalker を紹介します。
具体的には、より柔軟な感情操作を実現する、連続的でもつれのない潜在空間を構築できるガウス混合ベースの表現生成器を提案します。
さらに、広範囲のモーションを備えた大規模なデータセットで事前トレーニングされた正規化フローベースのモーションジェネレーターを導入し、多様な頭のポーズ、まばたき、眼球の動きを生成します。
最後に、高忠実度で忠実な感情的なビデオポートレートを合成できる感情マッピングネットワークを備えた、パーソナライズされた感情誘導型頭部ジェネレーターを提案します。
定量的実験と定性的実験の両方で、私たちの方法が画質、フォトリアリズム、感情の正確さ、動きの多様性の点で以前の方法よりも優れていることが実証されました。

要約(オリジナル)

Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.

arxiv情報

著者	Yibo Xia,Lizhen Wang,Xiang Deng,Xiaoyan Luo,Yebin Liu
発行日	2024-05-28 17:01:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GMTalker: Gaussian Mixture-based Audio-Driven Emotional talking video Portraits

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー