EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

要約

人間のスピーチは、情報の単なる転送を超えています。
それは感情の深いやり取りと個人間のつながりです。
テキストからスピーチ（TTS）モデルは大きな進歩を遂げましたが、生成された音声で感情的な表現を制御する際の課題に依然として課題に直面しています。
この作業では、大規模な言語モデル（LLMS）を活用してきめ細かいフリースタイルの自然言語感情制御を可能にする新しい感情に翻訳可能なTTSモデル、およびモデル出力の音素トークンと音声トークンを並行してコンテンツの一貫性を強化するために並行して並行して音声を上げるバリアントデザインを可能にするemovoiceを提案します。
また、表現力豊かな音声と自然言語の説明を含むきめの細かい感情ラベルを特徴とする高品質の40時間の英語感情データセットであるEmovoice-DBを紹介します。
Emovoiceは、合成トレーニングデータのみを使用してEnglish Emovoice-DBテストセット、および社内データを使用して中国のSECAPテストセットで最先端のパフォーマンスを実現します。
さらに、既存の感情評価メトリックの信頼性と、人間の知覚好みとの整合性を調査し、SOTAマルチモーダルLLMS GPT-4O-AudioおよびGeminiを使用して感情的な発言を評価します。
デモサンプルはhttps://anonymous.4open.science/r/emovoice-df55で入手できます。
データセット、コード、およびチェックポイントがリリースされます。

要約(オリジナル)

Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available at https://anonymous.4open.science/r/EmoVoice-DF55. Dataset, code, and checkpoints will be released.

arxiv情報

著者	Guanrou Yang,Chen Yang,Qian Chen,Ziyang Ma,Wenxi Chen,Wen Wang,Tianrui Wang,Yifan Yang,Zhikang Niu,Wenrui Liu,Fan Yu,Zhihao Du,Zhifu Gao,ShiLiang Zhang,Xie Chen
発行日	2025-04-18 08:18:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー