Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

要約

自己回帰 (AR) Transformer ベースのシーケンスモデルは、トレーニング中に見られるものよりも長いシーケンスに一般化することが難しいことが知られています。
これらのモデルをテキスト読み上げ (TTS) に適用すると、特に長い発話の場合、単語が省略されたり繰り返されたり、不規則な出力が生成される傾向があります。
このペーパーでは、これらの堅牢性と長さの一般化の問題に対処する、AR Transformer ベースのエンコーダ/デコーダ TTS システムを目的とした機能強化を紹介します。
私たちのアプローチでは、位置合わせメカニズムを使用して、相対位置情報によるクロスアテンション操作を提供します。
関連するアライメント位置は、バックプロップを介してモデルの潜在プロパティとして学習され、トレーニング中に外部のアライメント情報を必要としません。
このアプローチは、TTS 入出力アライメントの単調な性質に合わせて調整されていますが、インターリーブされたマルチヘッドのセルフおよびクロスアテンション操作の柔軟なモデリング能力の恩恵を受けることができます。
これらの改善を組み込んだシステムは、Very Attentive Tacotron と呼ばれており、ベースラインの T5 ベースの TTS システムの自然さと表現力に匹敵すると同時に、単語の繰り返しや脱落の問題を排除し、実用的な発話長に一般化することができます。

要約(オリジナル)

Autoregressive (AR) Transformer-based sequence models are known to have difficulty generalizing to sequences longer than those seen during training. When applied to text-to-speech (TTS), these models tend to drop or repeat words or produce erratic output, especially for longer utterances. In this paper, we introduce enhancements aimed at AR Transformer-based encoder-decoder TTS systems that address these robustness and length generalization issues. Our approach uses an alignment mechanism to provide cross-attention operations with relative location information. The associated alignment position is learned as a latent property of the model via backprop and requires no external alignment information during training. While the approach is tailored to the monotonic nature of TTS input-output alignment, it is still able to benefit from the flexible modeling power of interleaved multi-head self- and cross-attention operations. A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system, while eliminating problems with repeated or dropped words and enabling generalization to any practical utterance length.

arxiv情報

著者	Eric Battenberg,RJ Skerry-Ryan,Daisy Stanton,Soroosh Mariooryad,Matt Shannon,Julian Salazar,David Kao
発行日	2024-10-29 16:17:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー