Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

要約

以前の研究では、音声条件が与えられた任意のターゲットに対して、正確にリップシンクされた話し顔を生成する方法が検討されてきました。
ただし、それらのほとんどは、顔の領域全体を変形または生成するため、非現実的な結果につながります。
本作では、対象者の口の形だけを変えるという処方を掘り下げます。
これには、元の画像の大部分をマスキングし、オーディオと参照フレームを使用してシームレスに修復する必要があります。
この目的のために、マスクされた口の形を予測することにより、写真のようにリアルな品質の正確なリップシンクを生成する Audio-Visual Context-Aware Transformer (AV-CAT) フレームワークを提案します。
私たちの重要な洞察は、繊細に設計されたトランスフォーマーを使用して、オーディオおよびビジュアルモダリティで提供される必要なコンテキスト情報を徹底的に活用することです。
具体的には、convolution-Transformer ハイブリッドバックボーンを提案し、マスクされた部分を埋めるための注意ベースの融合戦略を設計します。
マスクされていない領域と参照フレームのテクスチャ情報に一様に注意を払います。
次に、セマンティックオーディオ情報は自己注意計算の強化に関与します。
さらに、オーディオインジェクションを備えた改良ネットワークにより、画像とリップシンクの両方の品質が向上します。
広範な実験により、モデルが任意の被験者に対して忠実度の高いリップシンク結果を生成できることが検証されました。

要約(オリジナル)

Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.

arxiv情報

著者	Yasheng Sun,Hang Zhou,Kaisiyuan Wang,Qianyi Wu,Zhibin Hong,Jingtuo Liu,Errui Ding,Jingdong Wang,Ziwei Liu,Hideki Koike
発行日	2022-12-09 16:32:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー