Augmenting conformers with structured state-space sequence models for online speech recognition

要約

モデルが左側のコンテキストにのみアクセスするオンライン音声認識は、ASR システムにとって重要かつ困難なユースケースです。
この研究では、構造化状態空間シーケンスモデル (S4) を組み込むことにより、オンライン ASR 用のニューラルエンコーダーを強化します。これは、任意の長さの左コンテキストにアクセスするパラメーター効率の高い方法を提供するモデルのファミリーです。
私たちは系統的なアブレーション研究を実施して S4 モデルのバリアントを比較し、それらを畳み込みと組み合わせた 2 つの新しいアプローチを提案しました。
最も効果的な設計は、実数値の再帰重みとローカル畳み込みを使用して小さな S4 をスタックし、それらが補完的に機能できるようにすることであることがわかりました。
当社の最良のモデルは、Librispeech のテストセットで 4.01%/8.53% の WER を達成し、広範に調整された畳み込みを備えた Conformers を上回っています。

要約(オリジナル)

Online speech recognition, where the model only accesses context to the left, is an important and challenging use case for ASR systems. In this work, we investigate augmenting neural encoders for online ASR by incorporating structured state-space sequence models (S4), a family of models that provide a parameter-efficient way of accessing arbitrarily long left context. We performed systematic ablation studies to compare variants of S4 models and propose two novel approaches that combine them with convolutions. We found that the most effective design is to stack a small S4 using real-valued recurrent weights with a local convolution, allowing them to work complementarily. Our best model achieves WERs of 4.01%/8.53% on test sets from Librispeech, outperforming Conformers with extensively tuned convolution.

arxiv情報

著者	Haozhe Shan,Albert Gu,Zhong Meng,Weiran Wang,Krzysztof Choromanski,Tara Sainath
発行日	2023-12-27 20:01:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Augmenting conformers with structured state-space sequence models for online speech recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー