Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

要約

オーディオ駆動の話し顔の生成は、指定されたオーディオと参照フレームから口同期されたリアルな顔のビデオを作成するタスクです。
これには 2 つの大きな課題が伴います。1 つは生成された画像の全体的な視覚的品質、もう 1 つは口の部分の視聴覚の同期です。
この論文では、最近のオーディオ駆動型の話し顔生成アプローチにおける同期方法のいくつかの問題点を特定することから始めます。
具体的には、これには、参照から生成された画像への唇と姿勢の情報の意図しない流れと、モデルのトレーニング中の不安定性が含まれます。
次に、これらの問題を回避するためのさまざまな手法を提案します。まず、サイレントリップ参照画像生成器は、生成された画像への参照から唇の漏れを防ぎます。
第 2 に、適応型トリプレット損失によりポーズリークの問題が処理されます。
最後に、唇漏れの問題をさらに軽減しながら、前述のトレーニングの不安定性を回避する、同期損失の安定した定式化を提案します。
個々の改善を組み合わせることで、LRS2 と LRW で同期とビジュアル品質の両方で最先端のパフォーマンスを実現します。
さらに、さまざまなアブレーション実験で設計を検証し、個々の寄与とそれらの相補的な効果を確認します。

要約(オリジナル)

Audio-driven talking face generation is the task of creating a lip-synchronized, realistic face video from given audio and reference frames. This involves two major challenges: overall visual quality of generated images on the one hand, and audio-visual synchronization of the mouth part on the other hand. In this paper, we start by identifying several problematic aspects of synchronization methods in recent audio-driven talking face generation approaches. Specifically, this involves unintended flow of lip and pose information from the reference to the generated image, as well as instabilities during model training. Subsequently, we propose various techniques for obviating these issues: First, a silent-lip reference image generator prevents leaking of lips from the reference to the generated image. Second, an adaptive triplet loss handles the pose leaking problem. Finally, we propose a stabilized formulation of synchronization loss, circumventing aforementioned training instabilities while additionally further alleviating the lip leaking issue. Combining the individual improvements, we present state-of-the art performance on LRS2 and LRW in both synchronization and visual quality. We further validate our design in various ablation experiments, confirming the individual contributions as well as their complementary effects.

arxiv情報

著者	Dogucan Yaman,Fevziye Irem Eyiokur,Leonard Bärmann,Hazim Kemal Ekenel,Alexander Waibel
発行日	2023-07-18 15:50:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Plug the Leaks: Advancing Audio-driven Talking Face Generation by Preventing Unintended Information Flow

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー