Learning to Dub Movies via Hierarchical Prosody Models

要約

タイトル：階層的プロソディモデルに基づく映画の吹き替え学習

要約：
・映画の吹き替え(ビジュアルボイスクローンV2C)タスクは、テキスト、ビデオクリップ、リファレンスオーディオを与えられた場合、希望する話者の声をリファレンスに使用して、ビデオで提示された話者の感情に合わせたスピーチを生成することを目的としている。
・V2Cは、従来のテキストから音声へのタスクよりも難しいとされており、ビデオで提示される変化する感情や話す速度に完全に合わせたスピーチの生成を必要とする。
・従来の方法とは異なり、我々は階層的プロソディモデリングを用いた新しい映画の吹き替えアーキテクチャを提案する。このモデルは、唇、顔、シーンの観点から視覚情報を対応する音声プロソディにつなげる。
・具体的には、口の動きを音声持続時間に合わせて整列させ、最近の心理学的研究からインスパイアされたvalenceとarousalの表現に基づくアテンションメカニズムを使用して、顔の表情を音声エネルギーとピッチに伝える。
・さらに、グローバルビデオシーンから雰囲気を捉えるエモーションブースターを設計する。
・これらのすべての埋め込みを組み合わせてmel-spectrogramを生成し、既存の音声合成器を使用してスピーチウェーブに変換する。
・ChemおよびV2Cベンチマークデータセットにおける広範囲な実験的結果は、提案手法の有利なパフォーマンスを示している。
・ソースコードとトレーニング済みのモデルは、一般に公開される予定である。

要約(オリジナル)

Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker’s emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by recent psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings together are used to generate mel-spectrogram and then convert to speech waves via existing vocoder. Extensive experimental results on the Chem and V2C benchmark datasets demonstrate the favorable performance of the proposed method. The source code and trained models will be released to the public.

arxiv情報

著者	Gaoxiang Cong,Liang Li,Yuankai Qi,Zhengjun Zha,Qi Wu,Wenyu Wang,Bin Jiang,Ming-Hsuan Yang,Qingming Huang
発行日	2023-04-04 11:33:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Learning to Dub Movies via Hierarchical Prosody Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー