Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

要約

我々は、音声認識、音声合成、テキスト生成、音声継続という 4 つのタスクを実行できるデコーダ専用言語モデル \textit{VoxtLM} を提案します。
VoxtLM は、テキスト語彙を自己教師あり音声特徴からの離散音声トークンと統合し、特別なトークンを使用してマルチタスク学習を可能にします。
シングルタスクモデルと比較して、VoxtLM は音声合成において大幅な改善を示し、音声明瞭度が 28.9 から 5.6 に、客観的品質が 2.68 から 3.90 に向上しました。
VoxtLM は、単一タスクの対応物よりも音声生成と音声認識のパフォーマンスも向上します。
VoxtLM は、公開されているデータとトレーニングレシピを使用してトレーニングされ、モデルチェックポイントは完全に再現可能な作業を行うためにオープンソース化されます。

要約(オリジナル)

We propose a decoder-only language model, \textit{VoxtLM}, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90. VoxtLM also improves speech generation and speech recognition performance over the single-task counterpart. VoxtLM is trained with publicly available data and training recipes and model checkpoints will be open-sourced to make fully reproducible work.

arxiv情報

著者	Soumi Maiti,Yifan Peng,Shukjae Choi,Jee-weon Jung,Xuankai Chang,Shinji Watanabe
発行日	2023-09-18 14:13:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー