Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

要約

拡散モデルの最近の進歩により、オーディオ駆動のトーキングヘッド合成に革命が起こりました。
正確なリップシンクを超えて、拡散ベースの方法は、オーディオ信号とよく調和した微妙な表現と自然な頭の動きを生成することに優れています。
ただし、これらの方法は、推論速度の遅さ、顔の動きに対するきめ細かな制御が不十分であること、および主に変分自動エンコーダ (VAE) から派生した暗黙的な潜在空間に起因する時折の視覚的アーティファクトの問題に直面しており、これらがリアルタイムインタラクションアプリケーションでの採用を妨げています。
これらの問題に対処するために、制御可能なリアルタイムトーキングヘッド合成を可能にする拡散ベースのフレームワークである Ditto を紹介します。
私たちの主な革新は、従来の VAE 表現を置き換え、明示的なアイデンティティに依存しないモーション空間を通じてモーション生成とフォトリアリスティックなニューラルレンダリングを橋渡しすることにあります。
この設計により、合成されたトーキングヘッドの正確な制御が可能になりながら、拡散学習の複雑さが大幅に軽減されます。
さらに、オーディオ特徴抽出、モーション生成、ビデオ合成という 3 つの主要なコンポーネントを共同で最適化する推論戦略を提案します。
この最適化により、ストリーミング処理、リアルタイム推論、最初のフレーム遅延の低減が可能になります。これらは、AI アシスタントなどの対話型アプリケーションにとって重要な機能です。
広範な実験結果は、Ditto が説得力のあるトーキングヘッドビデオを生成し、モーションコントロールとリアルタイムパフォーマンスの両方で既存の方法を大幅に上回っていることを示しています。

要約(オリジナル)

Recent advances in diffusion models have revolutionized audio-driven talking head synthesis. Beyond precise lip synchronization, diffusion-based methods excel in generating subtle expressions and natural head movements that are well-aligned with the audio signal. However, these methods are confronted by slow inference speed, insufficient fine-grained control over facial motions, and occasional visual artifacts largely due to an implicit latent space derived from Variational Auto-Encoders (VAE), which prevent their adoption in realtime interaction applications. To address these issues, we introduce Ditto, a diffusion-based framework that enables controllable realtime talking head synthesis. Our key innovation lies in bridging motion generation and photorealistic neural rendering through an explicit identity-agnostic motion space, replacing conventional VAE representations. This design substantially reduces the complexity of diffusion learning while enabling precise control over the synthesized talking heads. We further propose an inference strategy that jointly optimizes three key components: audio feature extraction, motion generation, and video synthesis. This optimization enables streaming processing, realtime inference, and low first-frame delay, which are the functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and substantially outperforms existing methods in both motion control and realtime performance.

arxiv情報

著者	Tianqi Li,Ruobing Zheng,Minghui Yang,Jingdong Chen,Ming Yang
発行日	2024-12-23 14:04:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー