From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

要約

自動音声認識（ASR）の最近の進歩は、大規模な音声コーパスによって大きく促進されています。
ただし、リソースが限られている多様な言語にカバレッジを拡張することは、恐ろしい課題のままです。
このペーパーでは、大規模なテキストコーパスを既製のテキストからスピーチ（TTS）モデルを介して合成音声に変換することにより、多言語ASRモデルを改善するスケーラブルなパイプラインである音声逆翻訳を紹介します。
数十時間の実際の転写された音声は、高品質を維持しながら、元のボリュームの数百倍の合成音声を生成するためにTTSモデルを効果的にトレーニングできることを実証します。
合成音声品質を評価するために、わかりやすさベースの評価フレームワークを開発し、合成データがASRトレーニングに役立つ場合の明確なしきい値を確立します。
音声逆翻訳を使用して、10言語で500,000時間以上の合成音声を生成し、トレーニング前のささやき声-V3を継続し、30 \％を超える平均転写誤差削減を達成します。
これらの結果は、多言語ASRシステムを強化するための音声逆翻訳のスケーラビリティと有効性を強調しています。

要約(オリジナル)

Recent advances in Automatic Speech Recognition (ASR) have been largely fueled by massive speech corpora. However, extending coverage to diverse languages with limited resources remains a formidable challenge. This paper introduces Speech Back-Translation, a scalable pipeline that improves multilingual ASR models by converting large-scale text corpora into synthetic speech via off-the-shelf text-to-speech (TTS) models. We demonstrate that just tens of hours of real transcribed speech can effectively train TTS models to generate synthetic speech at hundreds of times the original volume while maintaining high quality. To evaluate synthetic speech quality, we develop an intelligibility-based assessment framework and establish clear thresholds for when synthetic data benefits ASR training. Using Speech Back-Translation, we generate more than 500,000 hours of synthetic speech in ten languages and continue pre-training Whisper-large-v3, achieving average transcription error reductions of over 30\%. These results highlight the scalability and effectiveness of Speech Back-Translation for enhancing multilingual ASR systems.

arxiv情報

著者	Tianduo Wang,Lu Xu,Wei Lu,Shanbo Cheng
発行日	2025-05-22 17:51:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー