Zipformer: A faster and better encoder for automatic speech recognition

要約

Conformer は、自動音声認識 (ASR) 用の最も人気のあるエンコーダーモデルになりました。
畳み込みモジュールをトランスフォーマーに追加して、ローカルとグローバルの両方の依存関係を学習します。
この研究では、Zipformer と呼ばれる、より高速でメモリ効率が高く、パフォーマンスの優れたトランスフォーマーについて説明します。
モデリングの変更には次のものが含まれます。 1) 中間スタックが低いフレームレートで動作する U-Net のようなエンコーダ構造。
2) より多くのモジュールを含むブロック構造を再編成し、効率化のためにアテンションウェイトを再利用します。
3) BiasNorm と呼ばれる LayerNorm の修正された形式を使用すると、ある程度の長さの情報を保持できます。
4) 新しいアクティベーション関数 SwooshR および SwooshL は、Swish よりもうまく機能します。
また、ScaledAdam と呼ばれる新しいオプティマイザーも提案します。これは、相対的な変化をほぼ同じに保つために各テンソルの現在のスケールで更新をスケーリングし、パラメータースケールを明示的に学習します。
Adam よりも高速な収束と優れたパフォーマンスを実現します。
LibriSpeech、Aishell-1、および WenetSpeech データセットに関する広範な実験により、他の最先端の ASR モデルに対する私たちの提案する Zipformer の有効性が実証されています。
私たちのコードは https://github.com/k2-fsa/icefall で公開されています。

要約(オリジナル)

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame rates; 2) reorganized block structure with more modules, within which we re-use attention weights for efficiency; 3) a modified form of LayerNorm called BiasNorm allows us to retain some length information; 4) new activation functions SwooshR and SwooshL work better than Swish. We also propose a new optimizer, called ScaledAdam, which scales the update by each tensor’s current scale to keep the relative change about the same, and also explictly learns the parameter scale. It achieves faster convergence and better performance than Adam. Extensive experiments on LibriSpeech, Aishell-1, and WenetSpeech datasets demonstrate the effectiveness of our proposed Zipformer over other state-of-the-art ASR models. Our code is publicly available at https://github.com/k2-fsa/icefall.

arxiv情報

著者	Zengwei Yao,Liyong Guo,Xiaoyu Yang,Wei Kang,Fangjun Kuang,Yifan Yang,Zengrui Jin,Long Lin,Daniel Povey
発行日	2024-03-05 13:59:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zipformer: A faster and better encoder for automatic speech recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー