Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

要約

100 以上の言語にわたって自動音声認識 (ASR) を実行する単一の大規模モデルである Universal Speech Model (USM) を紹介します。
これは、300 言語を超える 1,200 万 (M) 時間の大規模なラベルなし多言語データセットでモデルのエンコーダーを事前トレーニングし、より小規模なラベル付きデータセットで微調整することで実現されます。
ランダム射影量子化と音声テキストモダリティマッチングによる多言語事前トレーニングを使用して、ダウンストリームの多言語 ASR および音声からテキストへの翻訳タスクで最先端のパフォーマンスを実現します。
また、Whisper モデルに使用されるサイズの 1/7 のラベル付きトレーニングセットを使用しているにもかかわらず、このモデルが多くの言語にわたるドメイン内およびドメイン外の両方の音声認識タスクで同等またはそれ以上のパフォーマンスを示すことも実証します。

要約(オリジナル)

We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages. This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages, and fine-tuning on a smaller labeled dataset. We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks. We also demonstrate that despite using a labeled training set 1/7-th the size of that used for the Whisper model, our model exhibits comparable or better performance on both in-domain and out-of-domain speech recognition tasks across many languages.

arxiv情報

著者	Yu Zhang,Wei Han,James Qin,Yongqiang Wang,Ankur Bapna,Zhehuai Chen,Nanxin Chen,Bo Li,Vera Axelrod,Gary Wang,Zhong Meng,Ke Hu,Andrew Rosenberg,Rohit Prabhavalkar,Daniel S. Park,Parisa Haghani,Jason Riesa,Ginger Perng,Hagen Soltau,Trevor Strohman,Bhuvana Ramabhadran,Tara Sainath,Pedro Moreno,Chung-Cheng Chiu,Johan Schalkwyk,Françoise Beaufays,Yonghui Wu
発行日	2023-09-25 01:20:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー