Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

要約

対面で会話する人間は口頭と非言語の両方で同時にコミュニケーションをとりますが、テキストからの音声音声と共同音声 3D ジェスチャモーションを共同統合して統合する方法は、新しく出現した分野です。
これらのテクノロジーは、より人間らしく、効率的で表現力豊かで堅牢な合成コミュニケーションを可能にする大きな期待を持っていますが、既存の手法はすべての構成モダリティからの並列データでトレーニングされるため、適切な大規模なデータセットが不足していることが現在妨げとなっています。
学生と教師の方法にヒントを得て、追加のトレーニング資料を合成するだけで、データ不足に対する直接的な解決策を提案します。
具体的には、大規模なデータセットでトレーニングされたユニモーダル合成モデルを使用してマルチモーダル (ただし合成) 並列トレーニングデータを作成し、そのマテリアルで結合合成モデルを事前トレーニングします。
さらに、当分野の最先端の方法に、より優れた、より制御可能な韻律モデリングを追加する新しい合成アーキテクチャを提案します。
私たちの結果は、大量の合成データで事前トレーニングすると、マルチモーダルモデルによって合成された音声と動きの両方の品質が向上し、提案されたアーキテクチャが合成データで事前トレーニングされた場合にさらなる利点が得られることを確認しています。
出力例については、https://shivammehta25.github.io/MAGI/ を参照してください。

要約(オリジナル)

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data. See https://shivammehta25.github.io/MAGI/ for example output.

arxiv情報

著者	Shivam Mehta,Anna Deichler,Jim O’Regan,Birger Moëll,Jonas Beskow,Gustav Eje Henter,Simon Alexanderson
発行日	2024-04-30 15:22:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー