EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

要約

拡散モデルはトーキングヘッド生成の分野に革命をもたらしましたが、表現力、制御性、および長時間の生成における安定性において依然として課題に直面しています。
この研究では、これらの問題に対処するための EmotiveTalk フレームワークを提案します。
まず、唇の動きと表情の生成をより適切に制御することを実現するために、唇の動きと表情に合わせた音声ベースの分離表現を生成するように、視覚誘導音声情報分離 (V-AID) アプローチが設計されています。
具体的には、音声表現空間と表情表現空間の間の整合を実現するために、V-AID 内に拡散ベースの同時音声時間拡張 (Di-CTE) モジュールを提供し、マルチソースの感情条件制約の下で表情関連表現を生成します。
次に、表現力の高いトーキングヘッドビデオを効率的に生成するための、適切に設計された Emotional Talking Head Diffusion (ETHD) バックボーンを提案します。これには、ターゲットの表情情報を統合しながら参照ポートレートから表情を自動的に切り離す Expression Decoupling Injection (EDI) モジュールが含まれており、
より表現力豊かな生成パフォーマンス。
実験結果は、EmotiveTalk が表現力豊かなトーキングヘッドビデオを生成でき、約束された感情の制御性と長時間生成中の安定性を確保し、既存の方法と比較して最先端のパフォーマンスを生み出すことができることを示しています。

要約(オリジナル)

Diffusion models have revolutionized the field of talking head generation, yet still face challenges in expressiveness, controllability, and stability in long-time generation. In this research, we propose an EmotiveTalk framework to address these issues. Firstly, to realize better control over the generation of lip movement and facial expression, a Vision-guided Audio Information Decoupling (V-AID) approach is designed to generate audio-based decoupled representations aligned with lip movements and expression. Specifically, to achieve alignment between audio and facial expression representation spaces, we present a Diffusion-based Co-speech Temporal Expansion (Di-CTE) module within V-AID to generate expression-related representations under multi-source emotion condition constraints. Then we propose a well-designed Emotional Talking Head Diffusion (ETHD) backbone to efficiently generate highly expressive talking head videos, which contains an Expression Decoupling Injection (EDI) module to automatically decouple the expressions from reference portraits while integrating the target expression information, achieving more expressive generation performance. Experimental results show that EmotiveTalk can generate expressive talking head videos, ensuring the promised controllability of emotions and stability during long-time generation, yielding state-of-the-art performance compared to existing methods.

arxiv情報

著者	Haotian Wang,Yuzhe Weng,Yueyan Li,Zilu Guo,Jun Du,Shutong Niu,Jiefeng Ma,Shan He,Xiaoyan Wu,Qiming Hu,Bing Yin,Cong Liu,Qingfeng Liu
発行日	2024-12-16 17:11:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EmotiveTalk: Expressive Talking Head Generation through Audio Information Decoupling and Emotional Video Diffusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー