Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

要約

音声合成としても知られる Text-to-Speech (TTS) は、テキストから自然な人間の音声を生成することを目的とした著名な研究分野です。
最近、産業上の需要の高まりに伴い、TTS テクノロジーは人間のような音声を合成するだけでなく、制御可能な音声生成を可能にするまで進化しました。
これには、感情、韻律、音色、長さなど、合成音声のさまざまな属性に対するきめ細かい制御が含まれます。
さらに、拡散モデルや大規模言語モデルなどの深層学習の進歩により、過去数年間で制御可能な TTS が大幅に強化されました。
本稿では、制御可能なTTSについて、基本的な制御技術から自然言語プロンプトを活用した手法まで網羅的に調査し、研究の現状を明らかにすることを目的としています。
一般的な制御可能な TTS パイプライン、課題、モデルアーキテクチャ、および制御戦略を調査し、既存の手法の包括的かつ明確な分類を提供します。
さらに、データセットと評価指標の詳細な概要を提供し、制御可能な TTS のアプリケーションと将来の方向性を明らかにします。
私たちの知る限り、この調査論文は、新しい制御可能な TTS 手法に関する最初の包括的なレビューを提供しており、学術研究者と業界関係者の両方にとって有益なリソースとして役立ちます。

要約(オリジナル)

Text-to-speech (TTS), also known as speech synthesis, is a prominent research area that aims to generate natural-sounding human speech from text. Recently, with the increasing industrial demand, TTS technologies have evolved beyond synthesizing human-like speech to enabling controllable speech generation. This includes fine-grained control over various attributes of synthesized speech such as emotion, prosody, timbre, and duration. Besides, advancements in deep learning, such as diffusion and large language models, have significantly enhanced controllable TTS over the past several years. In this paper, we conduct a comprehensive survey of controllable TTS, covering approaches ranging from basic control techniques to methods utilizing natural language prompts, aiming to provide a clear understanding of the current state of research. We examine the general controllable TTS pipeline, challenges, model architectures, and control strategies, offering a comprehensive and clear taxonomy of existing methods. Additionally, we provide a detailed summary of datasets and evaluation metrics and shed some light on the applications and future directions of controllable TTS. To the best of our knowledge, this survey paper provides the first comprehensive review of emerging controllable TTS methods, which can serve as a beneficial resource for both academic researchers and industry practitioners.

arxiv情報

著者	Tianxin Xie,Yan Rong,Pengfei Zhang,Li Liu
発行日	2024-12-09 15:50:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Survey

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー