DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

要約

【タイトル】
一般的な音声駆動の肖像アニメーション生成のための拡散モデルの作成（DiffTalk）

【要約】
– 話し手合成は、ビデオ制作業界にとって有望なアプローチである
– 生成品質の向上やモデルの汎化性能向上に多くの努力が注がれているが、両方を同時に改善できる研究はほとんど存在していない
– そのため、本論文では、Latent Diffusion Modelsを用いてTalking head generationを音声駆動の時間的に一貫したノイズ除去プロセスとしてモデル化する
– 具体的には、単一のドライブ要因として音声信号を利用する代わりに、話す顔の制御機構を調査し、参照顔画像とランドマークを個性に応じた一般化合成の条件として組み込む
– 提案されたDiffTalkは、ソース音声と同期した高品質な話し手ビデオを生成することができ、さらに、追加の微調整なしで異なるアイデンティティに自然に一般化することができる
– また、我々のDiffTalkは、無視できる追加コンピュータコストでより高解像度の合成に適用することができる
– 詳細な実験は、提案されたDiffTalkが一般的な新しいアイデンティティの高品質な音声駆動の話し手ビデオを効率的に合成することを示している

要約(オリジナル)

Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to \url{https://sstzal.github.io/DiffTalk/}.

arxiv情報

著者	Shuai Shen,Wenliang Zhao,Zibin Meng,Wanhua Li,Zheng Zhu,Jie Zhou,Jiwen Lu
発行日	2023-04-20 08:51:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー