RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

要約

この作品では、テキストの歌詞入力から直接 3D の全体的な体の動きを生成し、ボーカルを直接歌うという挑戦的なタスクを導入し、通常これら 2 つのモダリティを個別に扱う既存の作品を超えて前進します。
これを容易にするために、まず RapVerse データセットを収集します。これは、同期ラップボーカル、歌詞、高品質の 3D ホリスティックボディメッシュを含む大規模なデータセットです。
RapVerse データセットを使用して、言語、オーディオ、モーションにわたる自己回帰マルチモーダルトランスフォーマーのスケーリングが、ボーカルと全身の人間のモーションの一貫性のあるリアルな生成をどの程度強化できるかを調査します。
モダリティの統合では、ベクトル量子化変分オートエンコーダーを使用して全身のモーションシーケンスを離散モーショントークンにエンコードし、ボーカルからユニットへのモデルを活用して、コンテンツ、韻律情報、歌手のアイデンティティを保持する量子化オーディオトークンを取得します。
これら 3 つのモダリティのトランスフォーマーモデリングを統合された方法で共同で実行することにより、私たちのフレームワークは、ボーカルと人間の動きのシームレスでリアルなブレンドを保証します。
広範な実験により、当社の統合生成フレームワークが、テキスト入力から直接人間の動作に沿って一貫性のあるリアルな歌声を生成するだけでなく、特殊な単一モダリティ生成システムのパフォーマンスに匹敵し、共同音声動作生成の新しいベンチマークを確立することが実証されました。
プロジェクトページは、研究目的で https://vis-www.cs.umass.edu/RapVerse から利用できます。

要約(オリジナル)

In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information, and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation. The project page is available for research purposes at https://vis-www.cs.umass.edu/RapVerse.

arxiv情報

著者	Jiaben Chen,Xin Yan,Yihang Chen,Siyuan Cen,Qinwei Ma,Haoyu Zhen,Kaizhi Qian,Lie Lu,Chuang Gan
発行日	2024-05-30 17:59:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー