OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

要約

リップ同期は、ビデオのスピーカーのリップの動きを対応する音声オーディオと整列させるタスクであり、リアルで表現力のあるビデオコンテンツを作成するために不可欠です。
ただし、既存のメソッドは、多くの場合、参照フレームとマスクフレームのインペインティングに依存しており、その堅牢性をアイデンティティの一貫性、ポーズバリエーション、顔面閉塞、様式化されたコンテンツに制限します。
さらに、オーディオ信号は視覚的な手がかりよりも弱いコンディショニングを提供するため、元のビデオからのリップシェイプの漏れはリップ同期の品質に影響します。
この論文では、多様な視覚シナリオのためのユニバーサルリップ同期フレームワークであるOmnisyncを紹介します。
私たちのアプローチでは、明示的なマスクなしで直接フレーム編集のための拡散トランスモデルを使用して、マスクフリートレーニングパラダイムを導入し、自然な顔のダイナミクスを維持し、キャラクターのアイデンティティを保存しながら、無制限の期間推論を可能にします。
推論中に、ポーズとアイデンティティの一貫性を確保しながら、フローマッチングベースの進行性ノイズ初期化を提案し、正確な口領域の編集を可能にします。
オーディオの弱いコンディショニング信号に対処するために、時間と空間にわたってガイダンス強度を適応的に調整する動的な時空分類装置のないガイダンス（DS-CFG）メカニズムを開発します。
また、多様なAIで生成されたビデオでのリップ同期のための最初の評価スイートであるAIGC-Lipsyncベンチマークを確立します。
広範な実験は、Omnisyncが視覚品質とリップ同期精度の両方で以前の方法を大幅に上回り、現実世界とAIの両方のビデオで優れた結果を達成することを示しています。

要約(オリジナル)

Lip synchronization is the task of aligning a speaker’s lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames and masked-frame inpainting, which limit their robustness to identity consistency, pose variations, facial occlusions, and stylized content. In addition, since audio signals provide weaker conditioning than visual cues, lip shape leakage from the original video will affect lip sync quality. In this paper, we present OmniSync, a universal lip synchronization framework for diverse visual scenarios. Our approach introduces a mask-free training paradigm using Diffusion Transformer models for direct frame editing without explicit masks, enabling unlimited-duration inference while maintaining natural facial dynamics and preserving character identity. During inference, we propose a flow-matching-based progressive noise initialization to ensure pose and identity consistency, while allowing precise mouth-region editing. To address the weak conditioning signal of audio, we develop a Dynamic Spatiotemporal Classifier-Free Guidance (DS-CFG) mechanism that adaptively adjusts guidance strength over time and space. We also establish the AIGC-LipSync Benchmark, the first evaluation suite for lip synchronization in diverse AI-generated videos. Extensive experiments demonstrate that OmniSync significantly outperforms prior methods in both visual quality and lip sync accuracy, achieving superior results in both real-world and AI-generated videos.

arxiv情報

著者	Ziqiao Peng,Jiwen Liu,Haoxian Zhang,Xiaoqiang Liu,Songlin Tang,Pengfei Wan,Di Zhang,Hongyan Liu,Jun He
発行日	2025-05-27 17:20:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー