Facial Landmark Predictions with Applications to Metaverse

要約

この研究は、野生のビデオから学習した唇のアニメーションを追加することで、メタバースキャラクターをよりリアルにすることを目的としています。
これを達成するために、私たちのアプローチは、Tacotron 2 テキスト読み上げシンセサイザーを拡張して、1 つのパスでメルスペクトログラムと共に唇の動きを生成することです。
エンコーダーとゲートレイヤーの重みは LJ Speech 1.1 データセットで事前トレーニングされ、デコーダーは LRS 3 データセットから抽出された TED トークビデオの 93 クリップで再トレーニングされます。
私たちの新しいデコーダーは、OpenFace 2.0 ランドマークプレディクターによって自動的に抽出されたラベルを使用して、経時的に 20 の唇ランドマーク位置の変位を予測します。
トレーニングは、5 分未満のビデオを使用して 7 時間で収束しました。
オーディオとビジュアルの音声データ間の転送学習の有効性を実証するために、Pre/Post-Net および事前トレーニング済みエンコーダーの重みについてアブレーション研究を実施しました。

要約(オリジナル)

This research aims to make metaverse characters more realistic by adding lip animations learnt from videos in the wild. To achieve this, our approach is to extend Tacotron 2 text-to-speech synthesizer to generate lip movements together with mel spectrogram in one pass. The encoder and gate layer weights are pre-trained on LJ Speech 1.1 data set while the decoder is retrained on 93 clips of TED talk videos extracted from LRS 3 data set. Our novel decoder predicts displacement in 20 lip landmark positions across time, using labels automatically extracted by OpenFace 2.0 landmark predictor. Training converged in 7 hours using less than 5 minutes of video. We conducted ablation study for Pre/Post-Net and pre-trained encoder weights to demonstrate the effectiveness of transfer learning between audio and visual speech data.

arxiv情報

著者	Qiao Han,Jun Zhao,Kwok-Yan Lam
発行日	2022-09-29 11:49:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Facial Landmark Predictions with Applications to Metaverse

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー