Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

要約

リップシンク (リップシンク) のタスクは、人間の顔の唇をさまざまな音声と一致させることを目的としています。
映画業界だけでなく、仮想アバターの作成やビデオ会議にもさまざまな用途があります。
これは、アイデンティティ、ポーズ、感情、画質を維持しながら、詳細でリアルな唇の動きを同時に導入する必要があるため、難しい問題です。
この問題を解決しようとするこれまでの方法の多くは、完全なコンテキスト情報の欠如による画質の低下に悩まされていました。
この論文では、これらの品質を維持しながら、実際にリップシンクを実行できるオーディオ調整された拡散ベースのモデルである Diff2Lip を紹介します。
私たちは、自然の中で話す顔のビデオを含むビデオデータセットである Voxceleb2 でモデルをトレーニングします。
広範な研究により、私たちの方法は、ユーザーの開始距離 (FID) メトリクスと平均意見スコア (MOS) において、Wav2Lip や PC-AVS などの一般的な方法よりも優れていることが示されています。
Voxceleb2 および LRW データセットの再構成 (同じオーディオビデオ入力) 設定とクロス (異なるオーディオビデオ入力) 設定の両方の結果を示します。
ビデオ結果とコードにはプロジェクトページ ( https://soumik-kanad.github.io/diff2lip ) からアクセスできます。

要約(オリジナル)

The task of lip synchronization (lip-sync) seeks to match the lips of human faces with different audio. It has various applications in the film industry as well as for creating virtual avatars and for video conferencing. This is a challenging problem as one needs to simultaneously introduce detailed, realistic lip movements while preserving the identity, pose, emotions, and image quality. Many of the previous methods trying to solve this problem suffer from image quality degradation due to a lack of complete contextual information. In this paper, we present Diff2Lip, an audio-conditioned diffusion-based model which is able to do lip synchronization in-the-wild while preserving these qualities. We train our model on Voxceleb2, a video dataset containing in-the-wild talking face videos. Extensive studies show that our method outperforms popular methods like Wav2Lip and PC-AVS in Fr\’echet inception distance (FID) metric and Mean Opinion Scores (MOS) of the users. We show results on both reconstruction (same audio-video inputs) as well as cross (different audio-video inputs) settings on Voxceleb2 and LRW datasets. Video results and code can be accessed from our project page ( https://soumik-kanad.github.io/diff2lip ).

arxiv情報

著者	Soumik Mukhopadhyay,Saksham Suri,Ravi Teja Gadde,Abhinav Shrivastava
発行日	2023-08-18 17:59:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Diff2Lip: Audio Conditioned Diffusion Models for Lip-Synchronization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー