Scaling Rich Style-Prompted Text-to-Speech Datasets

要約

豊かなスタイルのキャプションでスピーチの発話を注釈する大規模なデータセットであるパラリング語の音声キャプション（ParaspeechCaps）を紹介します。
豊富な抽象タグ（guttural、鼻、痛みなど）は小規模なヒトが発音したデータセットで探索されていますが、既存の大規模なデータセットは基本的なタグのみをカバーしています（例：ローピッチ、スロー、ラウド）。
既製のテキストと音声埋め込み剤、分類子、オーディオ言語モデルを組み合わせて、リッチタグアノテーションを初めて自動的にスケーリングします。
ParaspeechCapsは、スピーカーレベルの内因性タグと発話レベルの状況タグの両方を含む、合計59のスタイルタグをカバーしています。
これは、342時間のヒト標識データ（PSCベース）と2427時間の自動注釈データ（PSCスケール）で構成されています。
ParaspeechCapsで、オープンソーススタイルでプロンプト化されたTTSモデルであるFinetune Parler-TTSは、既存のリッチスタイルタグデータセットを組み合わせた最高のパフォーマンスのベースラインで、スタイルの一貫性（+7.9％の一貫性MO）と音声品質（+15.5％自然性MO）を実現します。
この分野での将来の作業の基礎を築くために、データセット設計の選択肢のいくつかを和らげます。
データセット、モデル、コードはhttps://github.com/ajd12342/paraspeechcapsでリリースされます。

要約(オリジナル)

We introduce Paralinguistic Speech Captions (ParaSpeechCaps), a large-scale dataset that annotates speech utterances with rich style captions. While rich abstract tags (e.g. guttural, nasal, pained) have been explored in small-scale human-annotated datasets, existing large-scale datasets only cover basic tags (e.g. low-pitched, slow, loud). We combine off-the-shelf text and speech embedders, classifiers and an audio language model to automatically scale rich tag annotations for the first time. ParaSpeechCaps covers a total of 59 style tags, including both speaker-level intrinsic tags and utterance-level situational tags. It consists of 342 hours of human-labelled data (PSC-Base) and 2427 hours of automatically annotated data (PSC-Scaled). We finetune Parler-TTS, an open-source style-prompted TTS model, on ParaSpeechCaps, and achieve improved style consistency (+7.9% Consistency MOS) and speech quality (+15.5% Naturalness MOS) over the best performing baseline that combines existing rich style tag datasets. We ablate several of our dataset design choices to lay the foundation for future work in this space. Our dataset, models and code are released at https://github.com/ajd12342/paraspeechcaps .

arxiv情報

著者	Anuj Diwan,Zhisheng Zheng,David Harwath,Eunsol Choi
発行日	2025-03-06 18:57:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Rich Style-Prompted Text-to-Speech Datasets

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー