Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

要約

話し顔の生成は、特定の音声と正確に同期する必要がある、自然でリアルな顔を合成するという難しいタスクです。
分離された音声が前後の音の影響を受ける同時調音により、音声の調音は音声コンテキストに応じて変化します。
したがって、音声コンテキストを使用して唇の動きをモデル化すると、より時空間的に整合した唇の動きを生成できます。
この点で、話している顔の生成のための唇の動きを生成する際の音声コンテキストを調査します。
我々は、音声コンテキストを明示的に利用してターゲットの顔の唇の動きを生成する、コンテキスト認識リップシンクフレームワーク（CALS）を提案します。
CALS は、Audio-to-Lip モジュールと Lip-to-Face モジュールで構成されます。
前者は、マスクされた学習に基づいて事前トレーニングされ、各音をコンテキスト化された唇の動きユニットにマッピングします。
次に、コンテキスト化された唇の動きユニットは、コンテキストを意識した唇の動きでターゲットのアイデンティティを合成するように後者を導きます。
広範な実験から、提案された CALS フレームワークで音声コンテキストを利用するだけで、時空間の整合性が効果的に強化されることが確認されました。
また、音声コンテキストが口唇同期にどの程度役立つかを実証し、口唇生成の有効ウィンドウサイズが約 1.2 秒であることを発見しました。

要約(オリジナル)

Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.

arxiv情報

著者	Se Jin Park,Minsu Kim,Jeongsoo Choi,Yong Man Ro
発行日	2024-01-16 03:26:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー