Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition

要約

最近の研究では、音声処理の分野で、自動音声認識 (ASR) のための大規模なエンドツーエンド (E2E) システムが、さまざまなベンチマークで最先端のパフォーマンスを報告しています。
これらのシステムは本質的に、音声からノイズ状態を処理して除去する方法を学習します。
以前の研究では、これらのモデルのノイズ除去機能をプリプロセッサネットワークに抽出し、ダウンストリーム ASR モデルのフロントエンドとして使用できることが示されています。
ただし、提案された方法は特定の完全畳み込みアーキテクチャに限定されていました。
この研究では、あらゆるエンコーダ/デコーダアーキテクチャに適用できる、ノイズ除去機能を抽出する新しい方法を提案します。
私たちは、Conformer ASR モデルから隠れたアクティベーションを抽出し、それをデコーダーに供給してノイズ除去されたスペクトログラムを予測する Cleancoder プリプロセッサアーキテクチャを提案します。
ノイズのある入力からノイズ除去されたスペクトログラムを再構築するために、ノイズのある音声データベース (NSD) でプリプロセッサをトレーニングします。
次に、事前トレーニングされた Conformer ASR モデルのフロントエンドとして、またより小さな Conformer ASR モデルを最初からトレーニングするフロントエンドとしてモデルを評価します。
Cleancoder が音声からノイズをフィルタリングできること、および両方のアプリケーションのノイズの多い条件下でダウンストリームモデルの合計 Word Error Rate (WER) が改善されることを示します。

要約(オリジナル)

In recent research, in the domain of speech processing, large End-to-End (E2E) systems for Automatic Speech Recognition (ASR) have reported state-of-the-art performance on various benchmarks. These systems intrinsically learn how to handle and remove noise conditions from speech. Previous research has shown, that it is possible to extract the denoising capabilities of these models into a preprocessor network, which can be used as a frontend for downstream ASR models. However, the proposed methods were limited to specific fully convolutional architectures. In this work, we propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. We propose the Cleancoder preprocessor architecture that extracts hidden activations from the Conformer ASR model and feeds them to a decoder to predict denoised spectrograms. We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs. Then, we evaluate our model as a frontend to a pretrained Conformer ASR model as well as a frontend to train smaller Conformer ASR models from scratch. We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions for both applications.

arxiv情報

著者	Patrick Eickhoff,Matthias Möller,Theresa Pekarek Rosin,Johannes Twiefel,Stefan Wermter
発行日	2023-09-05 11:34:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー