DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

要約

音声によって駆動されるスタイル的な 3D 顔アニメーションの生成には、音声、スタイル、および対応する自然な顔の動きの間の多対多のマッピングを学習する必要があるため、大きな課題が生じます。
ただし、既存の方法では、音声から動作へのマッピングに決定論的モデルを採用するか、ワンホットエンコーディングスキームを使用してスタイルをエンコードします。
特に、ワンホットエンコーディングのアプローチではスタイルの複雑さを捉えることができず、そのため一般化能力が制限されます。
この論文では、短い参照ビデオからスタイルの埋め込みを抽出するスタイルエンコーダと組み合わせた拡散モデルに基づく生成フレームワークである DiffPoseTalk を提案します。
推論中に、分類子を使用しないガイダンスを採用して、音声とスタイルに基づいて生成プロセスをガイドします。
特に、私たちのスタイルには頭のポーズの生成が含まれており、それによってユーザーの知覚を向上させます。
さらに、高品質の実際の視聴覚データセットから再構成された 3DMM パラメータでモデルをトレーニングすることで、スキャンされた 3D 話者の顔データの不足に対処します。
広範な実験とユーザー調査により、私たちのアプローチが最先端の方法よりも優れていることが実証されています。
コードとデータセットは https://diffposetalk.github.io にあります。

要約(オリジナル)

The generation of stylistic 3D facial animations driven by speech presents a significant challenge as it requires learning a many-to-many mapping between speech, style, and the corresponding natural facial motion. However, existing methods either employ a deterministic model for speech-to-motion mapping or encode the style using a one-hot encoding scheme. Notably, the one-hot encoding approach fails to capture the complexity of the style and thus limits generalization ability. In this paper, we propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder that extracts style embeddings from short reference videos. During inference, we employ classifier-free guidance to guide the generation process based on the speech and style. In particular, our style includes the generation of head poses, thereby enhancing user perception. Additionally, we address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset. Extensive experiments and user study demonstrate that our approach outperforms state-of-the-art methods. The code and dataset are at https://diffposetalk.github.io .

arxiv情報

著者	Zhiyao Sun,Tian Lv,Sheng Ye,Matthieu Lin,Jenny Sheng,Yu-Hui Wen,Minjing Yu,Yong-Jin Liu
発行日	2024-05-14 13:12:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー