Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

要約

追加のコンテキスト情報を組み込むことにより、ディープバイアス手法が、パーソナライズされた単語の音声認識の有望なソリューションとして浮上しました。
ただし、現実世界の音声アシスタントの場合、予測スコアが高いこのようなパーソナライズされた単語に常に偏ると、一般的な単語を認識するパフォーマンスが大幅に低下する可能性があります。
この問題に対処するために、バイアスされたエンコーダと予測子の埋め込みを利用してコンテキストフレーズの出現のストリーミング予測を実行する、Context-Aware Transformer Transducer (CATT) に基づく適応コンテキストバイアス手法を提案します。
このような予測は、バイアスリストのオンとオフを動的に切り替えるために使用され、モデルが個別のシナリオと一般的なシナリオの両方に適応できるようにします。
Librispeech および内部音声アシスタントデータセットの実験では、私たちのアプローチがベースラインと比較してそれぞれ最大 6.7% および 20.7% の相対的な WER および 20.7% の相対的な削減を達成でき、一般的なユーザーの相対的な WER および CER の増加の最大 96.7% および 84.9% を緩和できることが示されています。
ケース。
さらに、私たちのアプローチは、RTF の増加を無視してストリーミング推論パイプラインを維持しながら、パーソナライズされたシナリオでのパフォーマンスへの影響を最小限に抑えます。

要約(オリジナル)

By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

arxiv情報

著者	Tianyi Xu,Zhanheng Yang,Kaixun Huang,Pengcheng Guo,Ao Zhang,Biao Li,Changru Chen,Chao Li,Lei Xie
発行日	2023-06-01 15:33:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー