Dimitra: Audio-driven Diffusion model for Expressive Talking Head Generation

要約

オーディオ駆動のトーキングヘッド生成のための新しいフレームワークであるディミトラを提案し、リップモーション、表情、ヘッドポーズモーションを学習するために合理化されています。
具体的には、3D表現を備えた顔の運動シーケンスをモデル化することにより、条件付き運動拡散トランス（CMDT）をトレーニングします。
CMDTは、2つの入力信号、オーディオシーケンス、および参照フェイシャルイメージのみを条件付けます。
オーディオから追加の機能を直接抽出することにより、Dimitraは生成されたビデオの品質とリアリズムを高めることができます。
特に、音素シーケンスは唇の動きのリアリズムに寄与しますが、テキストの転写は表情とヘッドポーズリアリズムに転写されます。
広く採用されている2つのデータセットであるVoxceleB2とHDTFでの定量的および定性的実験は、Dimitraが既存のアプローチを上回り、リップモーション、表情、およびヘッドポーズを与える現実的なトーキングヘッドを生成できることを示しています。

要約(オリジナル)

We propose Dimitra, a novel framework for audio-driven talking head generation, streamlined to learn lip motion, facial expression, as well as head pose motion. Specifically, we train a conditional Motion Diffusion Transformer (cMDT) by modeling facial motion sequences with 3D representation. We condition the cMDT with only two input signals, an audio-sequence, as well as a reference facial image. By extracting additional features directly from audio, Dimitra is able to increase quality and realism of generated videos. In particular, phoneme sequences contribute to the realism of lip motion, whereas text transcript to facial expression and head pose realism. Quantitative and qualitative experiments on two widely employed datasets, VoxCeleb2 and HDTF, showcase that Dimitra is able to outperform existing approaches for generating realistic talking heads imparting lip motion, facial expression, and head pose.

arxiv情報

著者	Baptiste Chopin,Tashvik Dhamija,Pranav Balaji,Yaohui Wang,Antitza Dantcheva
発行日	2025-02-24 14:31:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dimitra: Audio-driven Diffusion model for Expressive Talking Head Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー