CPSP: Learning Speech Concepts From Phoneme Supervision

要約

最小教師付き音声合成(TTS)、音声変換(VC)、自動音声認識(ASR)などのきめ細かな生成・認識タスクでは、音声から抽出される中間表現には、テキスト符号化と音響符号化の中間の情報が含まれている必要がある。言語的内容は顕著であり、話者の同一性や音響的詳細などの傍言語的情報は除去されるべきである。しかし、音声からきめ細かい中間表現を抽出する既存の手法は、過剰な冗長性と次元の爆発という問題に悩まされている。さらに、音声分野における既存の対照学習手法は、下流の音声分類タスクのための大域的な記述情報の抽出に重点を置いており、TTS、VC、ASRタスクには不向きである。これらの問題に対処するため、我々は、3つのエンコーダ、1つのデコーダ、および、音素と音声を結合したマルチモーダル空間に持ってくるための対比学習を用い、フレームレベルで音素と音声を接続する方法を学習する、対比音声事前学習（CPSP）と名付けられた手法を提案する。CPSPモデルは210kの音声と音素のテキストペアで学習され、最小教師ありのTTS、VC、ASRを実現した。提案するCPSP法は、音声処理におけるきめ細かい生成と認識の下流タスクに有望なソリューションを提供する。音声サンプルを含むウェブサイトを提供する。

要約(オリジナル)

For fine-grained generation and recognition tasks such as minimally-supervised text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), the intermediate representation extracted from speech should contain information that is between text coding and acoustic coding. The linguistic content is salient, while the paralinguistic information such as speaker identity and acoustic details should be removed. However, existing methods for extracting fine-grained intermediate representations from speech suffer from issues of excessive redundancy and dimension explosion. Additionally, existing contrastive learning methods in the audio field focus on extracting global descriptive information for downstream audio classification tasks, making them unsuitable for TTS, VC, and ASR tasks. To address these issues, we propose a method named Contrastive Phoneme-Speech Pretraining (CPSP), which uses three encoders, one decoder, and contrastive learning to bring phoneme and speech into a joint multimodal space, learning how to connect phoneme and speech at the frame level. The CPSP model is trained on 210k speech and phoneme text pairs, achieving minimally-supervised TTS, VC, and ASR. The proposed CPSP method offers a promising solution for fine-grained generation and recognition downstream tasks in speech processing. We provide a website with audio samples.

arxiv情報

著者	Chunyu Qiang,Hao Li,Yixin Tian,Ruibo Fu,Tao Wang,Longbiao Wang,Jianwu Dang
発行日	2023-09-01 12:35:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

CPSP: Learning Speech Concepts From Phoneme Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー