Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

要約

タイトル：統一された拡散ベースのジェネレーターを介したマルチモーダル駆動の話す顔生成

要約：

– マルチモーダル駆動の話す顔生成は、与えられたポーズ、表情、注視点を、ドライブイメージやビデオから転送されるか、テキストやオーディオから推定されたものを使用して肖像画にアニメーションを与えることを指します。
– しかし、既存の方式はテキストモーダルの潜在的な可能性を無視するために、彼らのジェネレーターは主に不安定なGANフレームワークに結合されたソース指向の特徴再配置のパラダイムに従っています。
– この研究では、まずCLIPから継承した豊富なセマンティックを備えたテキストプロンプトの感情を表現します。これにより、柔軟で汎用的な感情制御が可能になります。
– 次に、これらのタスクをターゲット指向のテクスチャ転送として再編成し、Diffusionモデルを採用します。
– 具体的には、ソースとしてテクスチャ付きの顔を、望ましい3DMM係数から射影されたレンダリングされた顔をターゲットとしているとき、提案されたTexture-Geometry-aware Diffusion Modelは、多条件のノイズリダクションプロセスに複雑な転送問題を分解します。
– ここで、Texture Attentionベースのモジュールは、ソースとターゲット条件に含まれる外観とジオメトリキューの対応関係を正確にモデル化し、高精度の話す顔生成のための追加の暗黙的な情報を組み込みます。
– さらに、TGDMは顔交換にも適用できます。私たちは不安定なシーソースタイルの最適化から解放された新しいパラダイムを導出し、シンプルで安定した効果的なトレーニングと推論スキームを実現しました。
– 幅広い実験により、私たちの方法の優位性が証明されました。

要約(オリジナル)

Multimodal-driven talking face generation refers to animating a portrait with the given pose, expression, and gaze transferred from the driving image and video, or estimated from the text and audio. However, existing methods ignore the potential of text modal, and their generators mainly follow the source-oriented feature rearrange paradigm coupled with unstable GAN frameworks. In this work, we first represent the emotion in the text prompt, which could inherit rich semantics from the CLIP, allowing flexible and generalized emotion control. We further reorganize these tasks as the target-oriented texture transfer and adopt the Diffusion Models. More specifically, given a textured face as the source and the rendered face projected from the desired 3DMM coefficients as the target, our proposed Texture-Geometry-aware Diffusion Model decomposes the complex transfer problem into multi-conditional denoising process, where a Texture Attention-based module accurately models the correspondences between appearance and geometry cues contained in source and target conditions, and incorporate extra implicit information for high-fidelity talking face generation. Additionally, TGDM can be gracefully tailored for face swapping. We derive a novel paradigm free of unstable seesaw-style optimization, resulting in simple, stable, and effective training and inference schemes. Extensive experiments demonstrate the superiority of our method.

arxiv情報

著者	Chao Xu,Shaoting Zhu,Junwei Zhu,Tianxin Huang,Jiangning Zhang,Ying Tai,Yong Liu
発行日	2023-05-09 12:01:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー