Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

要約

条件付き拡散モデルの最近の進歩により、現実的なトーキングフェイスビデオを生成することが有望であることが示されていますが、一貫したヘッドの動き、同期した表情、および長期にわたる正確なリップ同期を達成することに課題があります。
これらに対処するために、\ textbf {m} otion-priors \ textbf {c} onditional \ textbf {d} iffusion \ textbf {m} odel（\ textbf {mcdm}）を紹介します。
モーション予測を強化し、時間的一貫性を確保するため。
モデルは、3つの重要な要素で構成されています。（1）履歴フレームとアイデンティティとコンテキストを保持するための参照フレームを組み込んだアーカイブクリップモーション優先権。
（2）頭の動き、唇同期、および表現の正確な予測のためにマルチモーダル因果関係をキャプチャする現在のクリップ運動拡散モデル。
（3）モーション機能を動的に保存および更新することにより、エラーの蓄積を軽減するメモリ効率の高い時間的注意メカニズム。
また、10の言語にわたって200時間以上の映像の多言語コレクションである\ textbf {talkingface-wild}データセットをリリースします。
実験結果は、長期的な話し方フェイス生成のアイデンティティと運動の継続性を維持する上でMCDMの有効性を示しています。
コード、モデル、およびデータセットが公開されます。

要約(オリジナル)

Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the \textbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.

arxiv情報

著者	Fei Shen,Cong Wang,Junyao Gao,Qin Guo,Jisheng Dang,Jinhui Tang,Tat-Seng Chua
発行日	2025-02-13 17:50:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー