Fast Word Error Rate Estimation Using Self-Supervised Representations For Speech And Text

要約

自動音声認識 (ASR) の品質は通常、単語誤り率 (WER) によって測定されます。
WER 推定は、音声発話と書き起こしを考慮して、ASR システムの WER を予測することを目的としたタスクです。
高度な ASR システムが大量のデータでトレーニングされる中、このタスクへの注目が高まっています。
この場合、WER 推定は多くのシナリオで必要になります。たとえば、転写品質が不明なトレーニングデータを選択したり、グランドトゥルースの転写を使用せずに ASR システムのテストパフォーマンスを推定したりする場合です。
大量のデータに直面する場合、実際のアプリケーションでは WER 推定器の計算効率が不可欠になります。
ただし、以前の作品では通常、それが優先事項として考慮されていませんでした。
この論文では、自己教師あり学習表現 (SSLR) を使用した高速 WER 推定器 (Fe-WER) を紹介します。
この推定量は、平均プーリングによって集計された SSLR に基づいて構築されます。
結果は、二乗平均平方根誤差とピアソン相関係数の両方の評価指標において、Fe-WER が、Ted-Lium3 で e-WER3 ベースラインをそれぞれ 19.69% および 7.16% 相対的に上回ったことを示しています。
さらに、目標が 10.88% である場合、期間によって加重された推定は 10.43% でした。
最後に、推論速度はリアルタイム要素で約 4 倍になりました。

要約(オリジナル)

The quality of automatic speech recognition (ASR) is typically measured by word error rate (WER). WER estimation is a task aiming to predict the WER of an ASR system, given a speech utterance and a transcription. This task has gained increasing attention while advanced ASR systems are trained on large amounts of data. In this case, WER estimation becomes necessary in many scenarios, for example, selecting training data with unknown transcription quality or estimating the testing performance of an ASR system without ground truth transcriptions. Facing large amounts of data, the computation efficiency of a WER estimator becomes essential in practical applications. However, previous works usually did not consider it as a priority. In this paper, a Fast WER estimator (Fe-WER) using self-supervised learning representation (SSLR) is introduced. The estimator is built upon SSLR aggregated by average pooling. The results show that Fe-WER outperformed the e-WER3 baseline relatively by 19.69% and 7.16% on Ted-Lium3 in both evaluation metrics of root mean square error and Pearson correlation coefficient, respectively. Moreover, the estimation weighted by duration was 10.43% when the target was 10.88%. Lastly, the inference speed was about 4x in terms of a real-time factor.

arxiv情報

著者	Chanho Park,Chengsong Lu,Mingjie Chen,Thomas Hain
発行日	2023-10-12 11:17:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fast Word Error Rate Estimation Using Self-Supervised Representations For Speech And Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー