G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment

要約

数多くの研究が完了しているにもかかわらず、任意の音声に対応する高度に同期した唇の動きを持つ、忠実度の高い話し顔生成を実現することは、この分野では依然として大きな課題である。発表された研究の欠点は、多くの研究者を混乱させ続けている。本論文では、きめ細かなモーダル内アライメントを伴う忠実度の高い話し顔生成のための汎用フレームワークであるG4Gを紹介する。G4Gは、与えられた音声のトーンや音量に関係なく、高度に同期した唇の動きを生成しながら、元の映像の高い忠実度を再現することができる。G4Gの成功の鍵は、対角行列を使用して音声-画像内特徴量の通常のアライメントを強化することであり、これにより正と負のサンプル間の比較学習が大幅に向上する。さらに、マルチスケール監視モジュールを導入し、唇の動きと入力音声の同期を強調しながら、顔領域全体にわたってオリジナル映像の知覚忠実度を包括的に再現する。その後、融合ネットワークを用いて、顔領域とそれ以外をさらに融合する。私たちの実験結果は、高度に同期された話し唇と同様に、元のビデオの品質の再現において大きな成果を実証しています。G4Gは、現在の最先端手法よりも競争力のあるグランドトゥルースレベルに近いトーキングビデオを生成できる、優れた汎用フレームワークである。

要約(オリジナル)

Despite numerous completed studies, achieving high fidelity talking face generation with highly synchronized lip movements corresponding to arbitrary audio remains a significant challenge in the field. The shortcomings of published studies continue to confuse many researchers. This paper introduces G4G, a generic framework for high fidelity talking face generation with fine-grained intra-modal alignment. G4G can reenact the high fidelity of original video while producing highly synchronized lip movements regardless of given audio tones or volumes. The key to G4G’s success is the use of a diagonal matrix to enhance the ordinary alignment of audio-image intra-modal features, which significantly increases the comparative learning between positive and negative samples. Additionally, a multi-scaled supervision module is introduced to comprehensively reenact the perceptional fidelity of original video across the facial region while emphasizing the synchronization of lip movements and the input audio. A fusion network is then used to further fuse the facial region and the rest. Our experimental results demonstrate significant achievements in reenactment of original video quality as well as highly synchronized talking lips. G4G is an outperforming generic framework that can produce talking videos competitively closer to ground truth level than current state-of-the-art methods.

arxiv情報

著者	Juan Zhang,Jiahao Chen,Cheng Wang,Zhiwang Yu,Tangquan Qi,Di Wu
発行日	2024-03-02 14:47:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー