SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

要約

生成モデルの最近の進歩により、話し顔ビデオの生成は大幅に強化されましたが、歌のビデオ生成はまだ研究されていません。
人間の会話と歌の違いにより、既存の話顔ビデオ生成モデルを歌に適用した場合のパフォーマンスが制限されます。
話すことと歌うことの基本的な違い、特にオーディオ特性と行動表現の違いにより、既存のモデルの有効性が制限されます。
歌う音声と話す音声の違いは、周波数と振幅の点で現れることが観察されています。
これに対処するために、モデルがスペクトル領域で歌唱パターンを学習できるようにするマルチスケールスペクトルモジュールを設計しました。
さらに、歌う音声に関連する人間の行動をモデルが学習するのに役立つスペクトルフィルタリングモジュールを開発します。
これら 2 つのモジュールは、歌唱ビデオ生成パフォーマンスを強化するために拡散モデルに統合されており、その結果、私たちが提案するモデル SINGER が生まれます。
さらに、高品質の現実世界の歌唱顔ビデオが不足しているため、歌唱ビデオ生成コミュニティの発展が妨げられています。
このギャップに対処するために、私たちはこの分野の研究を促進するために、実際のオーディオビジュアル歌唱データセットを収集しました。
私たちの実験では、SINGER が生き生きとした歌唱ビデオを生成でき、客観的評価と主観的評価の両方で最先端の方法を上回っていることが実証されました。

要約(オリジナル)

Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model to enhance singing video generation performance, resulting in our proposed model, SINGER. Furthermore, the lack of high-quality real-world singing face videos has hindered the development of the singing video generation community. To address this gap, we have collected an in-the-wild audio-visual singing dataset to facilitate research in this area. Our experiments demonstrate that SINGER is capable of generating vivid singing videos and outperforms state-of-the-art methods in both objective and subjective evaluations.

arxiv情報

著者	Yan Li,Ziya Zhou,Zhiqiang Wang,Wei Xue,Wenhan Luo,Yike Guo
発行日	2024-12-04 16:19:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー