Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

要約

既存の音声対音声翻訳モデルは 2 つのグループに分類されます。1 つは数百時間の並列音声データでトレーニングされたテキストレスモデル、もう 1 つは中間ステップとしてテキストを活用する教師なしモデルです。
どちらのアプローチも、主に話し言葉である言語や、大規模な並列音声データが欠如している言語ペアを除外するため、幅広い言語の音声対音声翻訳モデルの構築に限界があります。
我々は、数十時間の並列音声データのみを必要とする、テキストレスの低リソース音声音声翻訳 (S2ST) システムをトレーニングするための新しいフレームワークを紹介します。
S2ST をユニット間の seq2seq 変換タスクとして再定式化し、大規模な単言語音声データでモデルを事前トレーニングすることから始めます。
次に、少量の並列音声データ ($20 ～ 60$ 時間) を使用して微調整します。
最後に、教師なし逆変換目標を通じてモデルのパフォーマンスを向上させます。
私たちは、単一話者の合成音声データを使用して、3 つの異なるドメイン (欧州議会、Common Voice、全インドラジオ) で英語からドイツ語、ドイツ語から英語、マラーティー語から英語への翻訳モデルをトレーニングし、評価します。
ASR-BLEU メトリクスを使用して評価すると、当社のモデルは 3 つのドメインすべてで妥当なパフォーマンスを達成しており、一部は監視トップラインの 1 ～ 2 ポイント以内にあります。

要約(オリジナル)

Existing speech-to-speech translation models fall into two camps: textless models trained with hundreds of hours of parallel speech data or unsupervised models that leverage text as an intermediate step. Both approaches limit building speech-to-speech translation models for a wide range of languages, as they exclude languages that are primarily spoken and language pairs that lack large-scale parallel speech data. We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems that only need dozens of hours of parallel speech data. We reformulate S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data. Then, we finetune it with a small amount of parallel speech data ($20-60$ hours). Lastly, we improve model performance through an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech data. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our supervised topline.

arxiv情報

著者	Anuj Diwan,Anirudh Srinivasan,David Harwath,Eunsol Choi
発行日	2024-02-20 18:55:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Textless Low-Resource Speech-to-Speech Translation With Unit Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー