Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

要約

GPT や DALL-E などの大規模な生成モデルは、研究コミュニティに革命をもたらしました。
これらのモデルは、忠実度の高い出力を生成するだけでなく、明示的に教えられていないタスクを解決できる汎用性の高いモデルでもあります。
対照的に、音声生成モデルは、規模とタスクの一般化の点でまだ原始的です。
このペーパーでは、大規模な音声のための最も多用途のテキストガイド付き生成モデルである Voicebox を紹介します。
Voicebox は、音声コンテキストとテキストを指定して音声を埋めるようにトレーニングされた非自己回帰フローマッチングモデルで、フィルターや強化されていない 50,000 時間以上の音声でトレーニングされています。
GPT と同様に、Voicebox はコンテキスト内学習を通じてさまざまなタスクを実行できますが、将来のコンテキストを条件にすることもできるため、より柔軟です。
Voicebox は、モノラルまたはクロスリンガルのゼロショットテキスト音声合成、ノイズ除去、コンテンツ編集、スタイル変換、および多様なサンプル生成に使用できます。
特に、Voicebox は、明瞭度 (単語誤り率 5.9% 対 1.9%) と音声類似性 (0.580 対 0.681) の両方で最先端のゼロショット TTS モデル VALL-E を上回り、最大 20 倍高速です。
音声サンプルは \url{https://voicebox.metademolab.com} にあります。

要約(オリジナル)

Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative model for speech at scale. Voicebox is a non-autoregressive flow-matching model trained to infill speech, given audio context and text, trained on over 50K hours of speech that are not filtered or enhanced. Similar to GPT, Voicebox can perform many different tasks through in-context learning, but is more flexible as it can also condition on future context. Voicebox can be used for mono or cross-lingual zero-shot text-to-speech synthesis, noise removal, content editing, style conversion, and diverse sample generation. In particular, Voicebox outperforms the state-of-the-art zero-shot TTS model VALL-E on both intelligibility (5.9% vs 1.9% word error rates) and audio similarity (0.580 vs 0.681) while being up to 20 times faster. Audio samples can be found in \url{https://voicebox.metademolab.com}.

arxiv情報

著者	Matthew Le,Apoorv Vyas,Bowen Shi,Brian Karrer,Leda Sari,Rashel Moritz,Mary Williamson,Vimal Manohar,Yossi Adi,Jay Mahadeokar,Wei-Ning Hsu
発行日	2023-10-19 13:23:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー