PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

要約

任意の音声オーディオを使用したトーキングヘッド合成は、デジタルヒューマンの分野における重要な課題です。
最近、放射フィールドに基づく手法は、わずか数分のトレーニングビデオから高忠実度で同一性の一貫したトーキングヘッドを合成できるため、ますます注目を集めています。
ただし、トレーニングデータの規模が限られているため、これらの方法では音声とリップの同期や視覚的な品質のパフォーマンスが劣ることがよくあります。
この論文では、PointTalk と呼ばれる新しい 3D ガウスベースの方法を提案します。この方法は、頭部の静的な 3D ガウスフィールドを構築し、音声と同期してそれを変形します。
また、条件付き情報の重要なコンポーネントとしてオーディオ駆動の動的なリップポイントクラウドも組み込まれているため、トーキングヘッドの効果的な合成が容易になります。
具体的には、最初のステップでは、オーディオ信号から対応する唇ポイントクラウドを生成し、その位相構造をキャプチャします。
ダイナミックディファレンスエンコーダーの設計は、ダイナミックな唇の動きに固有の微妙なニュアンスをより効果的に捉えることを目的としています。
さらに、オーディオポイント拡張モジュールを統合します。これにより、オーディオ信号と特徴空間内の対応するリップポイントクラウドの同期が保証されるだけでなく、クロスモーダル条件付き特徴間の相互関係のより深い理解も促進されます。
広範な実験により、私たちの方法は以前の方法と比較して、トーキングヘッド合成において優れた高忠実度およびオーディオリップ同期を達成できることが実証されました。

要約(オリジナル)

Talking head synthesis with arbitrary speech audio is a crucial challenge in the field of digital humans. Recently, methods based on radiance fields have received increasing attention due to their ability to synthesize high-fidelity and identity-consistent talking heads from just a few minutes of training video. However, due to the limited scale of the training data, these methods often exhibit poor performance in audio-lip synchronization and visual quality. In this paper, we propose a novel 3D Gaussian-based method called PointTalk, which constructs a static 3D Gaussian field of the head and deforms it in sync with the audio. It also incorporates an audio-driven dynamic lip point cloud as a critical component of the conditional information, thereby facilitating the effective synthesis of talking heads. Specifically, the initial step involves generating the corresponding lip point cloud from the audio signal and capturing its topological structure. The design of the dynamic difference encoder aims to capture the subtle nuances inherent in dynamic lip movements more effectively. Furthermore, we integrate the audio-point enhancement module, which not only ensures the synchronization of the audio signal with the corresponding lip point cloud within the feature space, but also facilitates a deeper understanding of the interrelations among cross-modal conditional features. Extensive experiments demonstrate that our method achieves superior high-fidelity and audio-lip synchronization in talking head synthesis compared to previous methods.

arxiv情報

著者	Yifan Xie,Tao Feng,Xin Zhang,Xiangyang Luo,Zixuan Guo,Weijiang Yu,Heng Chang,Fei Ma,Fei Richard Yu
発行日	2024-12-11 16:15:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー