SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation

要約

しゃべる顔生成に関する先行研究の多くは、唇の動きと発話内容の同期に焦点を当ててきた。しかし、頭のポーズや顔の感情も、自然な顔の重要な特徴です。音声駆動による話し顔生成は顕著な進歩を遂げているが、既存の手法は顔の感情を見落としたり、特定の個人に限定されたりしており、任意の被験者には適用できない。本論文では、感情と姿勢の制御を可能にすることで、一般的な話し顔生成とは一線を画す、新しいワンショット話し顔生成フレームワーク（SPEAK）を提案する。具体的には、顔の特徴を3つの潜在空間に分離するために、Inter-Reconstructed Feature Disentanglement (IRFD)モジュールを導入する。次に、発話内容と顔の潜在コードを単一の潜在空間に修正する顔編集モジュールを設計する。その後、編集モジュールから得られた修正潜在コードを用いて、顔アニメーションを合成する際に、感情表現、頭部ポーズ、発話内容を調整する新しいジェネレータを提示する。広範なトライアルにより、本手法が、顔の特徴の分離制御を可能にしながら、音声との唇の同期を保証し、協調的な唇の動き、本物の顔の感情、滑らかな頭の動きを持つリアルなトーキングヘッドを生成できることが実証された。デモビデオはこちら: https://anonymous.4open.science/r/SPEAK-8A22

要約(オリジナル)

Most earlier researches on talking face generation have focused on the synchronization of lip motion and speech content. However, head pose and facial emotions are equally important characteristics of natural faces. While audio-driven talking face generation has seen notable advancements, existing methods either overlook facial emotions or are limited to specific individuals and cannot be applied to arbitrary subjects. In this paper, we propose a novel one-shot Talking Head Generation framework (SPEAK) that distinguishes itself from the general Talking Face Generation by enabling emotional and postural control. Specifically, we introduce Inter-Reconstructed Feature Disentanglement (IRFD) module to decouple facial features into three latent spaces. Then we design a face editing module that modifies speech content and facial latent codes into a single latent space. Subsequently, we present a novel generator that employs modified latent codes derived from the editing module to regulate emotional expression, head poses, and speech content in synthesizing facial animations. Extensive trials demonstrate that our method ensures lip synchronization with the audio while enabling decoupled control of facial features, it can generate realistic talking head with coordinated lip motions, authentic facial emotions, and smooth head movements. The demo video is available: https://anonymous.4open.science/r/SPEAK-8A22

arxiv情報

著者	Changpeng Cai,Guinan Guo,Jiao Li,Junhao Su,Fei Shen,Chenghao He,Jing Xiao,Yuanxu Chen,Lei Dai,Feiyu Zhu
発行日	2024-11-04 16:42:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

SPEAK: Speech-Driven Pose and Emotion-Adjustable Talking Head Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー