Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

要約

トーキングヘッド合成は、バーチャルアバターや人間とコンピュータのインタラクションに不可欠である。しかし、既存の手法のほとんどは、単一の主要なモダリティからの制御を受け付けるように制限されており、実用的な有用性が制限されています。この目的のために、トーキングヘッド映像生成のための複数信号制御と単一信号制御の両方をサポートするエンドツーエンドの映像拡散フレームワークである୧⃛(๑⃙⃘⁼̴̀꒳⁼̴́๑⃙⃘)୨⃛を紹介します。マルチ制御では、複数のブランチを持つ並列マンバ構造を設計し、それぞれが特定の顔領域を制御するために個別の駆動信号を利用する。ゲート機構はすべての分岐に適用され、ビデオ生成の柔軟な制御を提供する。制御された映像の時間的・空間的な自然な調整を保証するために、各ブランチにおいて両次元にわたる特徴トークンを操作する駆動信号を可能にするマンバ構造を採用する。さらに、各駆動信号がマンバ構造内の対応する顔領域を独立して制御できるようにするマスクドロップ戦略を導入し、制御の競合を防ぐ。実験結果は、私たちの方法が多様な信号によって駆動される自然な顔のビデオを生成し、マンバ層が競合することなく複数の駆動モダリティをシームレスに統合することを実証しています。このプロジェクトのウェブサイトは〚https://harlanhong.github.io/publications/actalker/index.html 〛にあります。

要約(オリジナル)

Talking head synthesis is vital for virtual avatars and human-computer interaction. However, most existing methods are typically limited to accepting control from a single primary modality, restricting their practical utility. To this end, we introduce \textbf{ACTalker}, an end-to-end video diffusion framework that supports both multi-signals control and single-signal control for talking head video generation. For multiple control, we design a parallel mamba structure with multiple branches, each utilizing a separate driving signal to control specific facial regions. A gate mechanism is applied across all branches, providing flexible control over video generation. To ensure natural coordination of the controlled video both temporally and spatially, we employ the mamba structure, which enables driving signals to manipulate feature tokens across both dimensions in each branch. Additionally, we introduce a mask-drop strategy that allows each driving signal to independently control its corresponding facial region within the mamba structure, preventing control conflicts. Experimental results demonstrate that our method produces natural-looking facial videos driven by diverse signals and that the mamba layer seamlessly integrates multiple driving modalities without conflict. The project website can be found at \href{https://harlanhong.github.io/publications/actalker/index.html}{HERE}.

arxiv情報

著者	Fa-Ting Hong,Zunnan Xu,Zixiang Zhou,Jun Zhou,Xiu Li,Qin Lin,Qinglin Lu,Dan Xu
発行日	2025-04-04 06:51:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Audio-visual Controlled Video Diffusion with Masked Selective State Spaces Modeling for Natural Talking Head Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー