Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

要約

拡散モデルの最近の進歩により、微妙な表現と鮮やかなヘッドの動きを備えたトーキングヘッド合成が恵まれていますが、推論の速度が遅く、生成された結果を不十分に制御することにもつながりました。
これらの問題に対処するために、微調整されたコントロールとリアルタイムの推論を可能にする拡散ベースのトーキングヘッドフレームワークであるDittoを提案します。
具体的には、既製のモーション抽出器を利用し、拡散トランスを考案して、特定のモーション空間で表現を生成します。
モデルアーキテクチャとトレーニング戦略を最適化して、モーションとアイデンティティの間の不十分な解体や表現内の大規模な内部矛盾など、モーション表現の生成における問題に対処します。
また、モーション表現と顔面セマンティクスの間のマッピングを確立しながら、さまざまな条件付きシグナルを使用し、生成プロセスと結果の修正を制御できます。
さらに、総合的なフレームワークを共同で最適化して、ストリーミング処理、リアルタイム推論、および低いフレーム遅延を可能にし、AIアシスタントなどのインタラクティブアプリケーションに重要な機能を提供します。
広範な実験結果は、Dittoが説得力のあるトーキングヘッドビデオを生成し、制御可能性とリアルタイムのパフォーマンスの両方で優位性を示すことを示しています。

要約(オリジナル)

Recent advances in diffusion models have endowed talking head synthesis with subtle expressions and vivid head movements, but have also led to slow inference speed and insufficient control over generated results. To address these issues, we propose Ditto, a diffusion-based talking head framework that enables fine-grained controls and real-time inference. Specifically, we utilize an off-the-shelf motion extractor and devise a diffusion transformer to generate representations in a specific motion space. We optimize the model architecture and training strategy to address the issues in generating motion representations, including insufficient disentanglement between motion and identity, and large internal discrepancies within the representation. Besides, we employ diverse conditional signals while establishing a mapping between motion representation and facial semantics, enabling control over the generation process and correction of the results. Moreover, we jointly optimize the holistic framework to enable streaming processing, real-time inference, and low first-frame delay, offering functionalities crucial for interactive applications such as AI assistants. Extensive experimental results demonstrate that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.

arxiv情報

著者	Tianqi Li,Ruobing Zheng,Minghui Yang,Jingdong Chen,Ming Yang
発行日	2025-04-30 09:42:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー