StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

要約

スクリプトが与えられた場合、映画吹き替え (ビジュアルボイスクローン、V2C) の課題は、参照オーディオトラックのトーンに基づいて、時間的にも感情的にもビデオとよく一致する音声を生成することです。
既存の最先端の V2C モデルは、ビデオフレーム間の分割に応じてスクリプト内の音素を分割します。これにより、時間的位置合わせの問題は解決されますが、不完全な音素の発音と貧弱なアイデンティティの安定性につながります。
この問題を解決するために、我々はアフレコ学習をフレームレベルから音素レベルに切り替えるStyleDubberを提案します。
これには 3 つの主要なコンポーネントが含まれています。(1) 音素レベルで動作するマルチモーダルスタイルアダプター。リファレンスオーディオから発音スタイルを学習し、ビデオに表示される顔の感情によって通知される中間表現を生成します。
(2) 発話レベルのスタイル学習モジュール。メルスペクトログラムのデコードと中間埋め込みからの洗練プロセスの両方をガイドして、全体的なスタイル表現を改善します。
そして (3) リップシンクを維持するための音素ガイド付きリップアライナー。
2 つの主要なベンチマークである V2C とグリッドに関する広範な実験により、現在の最先端技術と比較して、提案された方法の良好なパフォーマンスが実証されました。
ソースコードとトレーニング済みモデルは一般に公開されます。

要約(オリジナル)

Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track. Existing state-of-the-art V2C models break the phonemes in the script according to the divisions between video frames, which solves the temporal alignment problem but leads to incomplete phoneme pronunciation and poor identity stability. To address this problem, we propose StyleDubber, which switches dubbing learning from the frame level to phoneme level. It contains three main components: (1) A multimodal style adaptor operating at the phoneme level to learn pronunciation style from the reference audio, and generate intermediate representations informed by the facial emotion presented in the video; (2) An utterance-level style learning module, which guides both the mel-spectrogram decoding and the refining processes from the intermediate embeddings to improve the overall style expression; And (3) a phoneme-guided lip aligner to maintain lip sync. Extensive experiments on two of the primary benchmarks, V2C and Grid, demonstrate the favorable performance of the proposed method as compared to the current state-of-the-art. The source code and trained models will be released to the public.

arxiv情報

著者	Gaoxiang Cong,Yuankai Qi,Liang Li,Amin Beheshti,Zhedong Zhang,Anton van den Hengel,Ming-Hsuan Yang,Chenggang Yan,Qingming Huang
発行日	2024-02-21 14:29:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー