Controllable Speaking Styles Using a Large Language Model

要約

参照ベースの Text-to-Speech (TTS) モデルは、同じターゲットテキストの韻律的に異なる複数の表現を生成できます。
このようなモデルは、トレーニング中に潜在音響空間を共同学習し、推論中にサンプリングすることができます。
推論中にこれらのモデルを制御するには、通常、適切な参照発話を見つける必要がありますが、これは自明ではありません。
大規模な生成言語モデル (LLM) は、さまざまな言語関連のタスクで優れたパフォーマンスを示しています。
自然言語クエリテキスト (プロンプト) のみが与えられた場合、そのようなモデルを使用して、特定のコンテキスト依存タスクを解決できます。
TTS における最近の研究では、新しい話し方の生成に関して同様のプロンプトベースの制御が試みられています。
これらの方法は参照発話を必要とせず、理想的な条件下ではプロンプトのみで制御できます。
しかし、既存の方法では通常、プロンプト条件付きエンコーダを共同トレーニングするためにプロンプトラベル付き音声コーパスが必要です。
対照的に、代わりに LLM を使用して、プロンプトで提供されるコンテキスト情報を使用して、制御可能な TTS モデルの韻律変更を直接提案します。
プロンプトは、さまざまなタスク用に設計できます。
ここでは、話し方のコントロールと、話し方のコントロールの 2 つのデモンストレーションを示します。
特定の対話の文脈に適した韻律。
提案された方法は、ケースの 50% で最も適切であると評価されていますが、ベースラインモデルの場合は 31% です。

要約(オリジナル)

Reference-based Text-to-Speech (TTS) models can generate multiple, prosodically-different renditions of the same target text. Such models jointly learn a latent acoustic space during training, which can be sampled from during inference. Controlling these models during inference typically requires finding an appropriate reference utterance, which is non-trivial. Large generative language models (LLMs) have shown excellent performance in various language-related tasks. Given only a natural language query text (the prompt), such models can be used to solve specific, context-dependent tasks. Recent work in TTS has attempted similar prompt-based control of novel speaking style generation. Those methods do not require a reference utterance and can, under ideal conditions, be controlled with only a prompt. But existing methods typically require a prompt-labelled speech corpus for jointly training a prompt-conditioned encoder. In contrast, we instead employ an LLM to directly suggest prosodic modifications for a controllable TTS model, using contextual information provided in the prompt. The prompt can be designed for a multitude of tasks. Here, we give two demonstrations: control of speaking style; prosody appropriate for a given dialogue context. The proposed method is rated most appropriate in 50% of cases vs. 31% for a baseline model.

arxiv情報

著者	Atli Thor Sigurgeirsson,Simon King
発行日	2023-09-19 16:35:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Controllable Speaking Styles Using a Large Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー