Globally Normalising the Transducer for Streaming Speech Recognition

要約

トランスデューサ (RNN トランスデューサやコンフォーマトランスデューサなど) は、入力シーケンスをトラバースするときに出力ラベルシーケンスを生成します。
ストリーミングモードでの使用は簡単で、完全な入力が確認される前に部分的な仮説が生成されます。
このため、音声認識で人気があります。
ただし、ストリーミングモードでは、トランスデューサーには数学的欠陥があり、簡単に言えば、モデルの考えを変える能力が制限されます。
修正方法は、ローカル正規化 (ソフトマックスなど) をグローバル正規化に置き換えることですが、そうすると損失関数を正確に評価することができなくなります。
最近の論文では、モデルを近似することでこの問題を解決することが提案されており、パフォーマンスが大幅に低下します。
代わりに、この論文では、損失関数を近似して、グローバル正規化を最先端のストリーミングモデルに適用できるようにすることを提案します。
グローバル正規化により、ワードエラー率が相対的に 9 ～ 11% 減少し、ストリーミングモードと先読みモードの間のギャップがほぼ半分に縮まります。

要約(オリジナル)

The Transducer (e.g. RNN-Transducer or Conformer-Transducer) generates an output label sequence as it traverses the input sequence. It is straightforward to use in streaming mode, where it generates partial hypotheses before the complete input has been seen. This makes it popular in speech recognition. However, in streaming mode the Transducer has a mathematical flaw which, simply put, restricts the model’s ability to change its mind. The fix is to replace local normalisation (e.g. a softmax) with global normalisation, but then the loss function becomes impossible to evaluate exactly. A recent paper proposes to solve this by approximating the model, severely degrading performance. Instead, this paper proposes to approximate the loss function, allowing global normalisation to apply to a state-of-the-art streaming model. Global normalisation reduces its word error rate by 9-11% relative, closing almost half the gap between streaming and lookahead mode.

arxiv情報

著者	Rogier van Dalen
発行日	2023-07-20 16:04:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Globally Normalising the Transducer for Streaming Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー