Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

要約

コンテキストバイアスにより、音声認識プログラムは、連絡先名などの重要なフレーズが、トレーニングデータ内でまれであるか、トレーニングデータに存在しない場合でも、話者のコンテキスト内で書き写すことができます。
アテンションベースのバイアスは、認識装置とバイアスシステムの完全なエンドツーエンドの共同トレーニングを可能にし、別個の推論時間コンポーネントを必要としない主要なアプローチです。
このようなバイアサーは通常、コンテキストエンコーダーで構成されます。
その後に、適用するコンテキストを絞り込むコンテキストフィルターが続き、ステップごとの推論時間を短縮します。
そして最後に、クロスアテンションによるコンテキストの適用です。
フレームごとのパフォーマンスの最適化には多くの作業が費やされていますが、コンテキストエンコーダも少なくとも同じくらい重要です。コンテキストエンコーダが終了するまで認識を開始することはできません。
ここでは、軽量フレーズ選択パスをコンテキストエンコード前に移動できることを示します。その結果、最大 16.1 倍の速度向上が得られ、最大プリデコード遅延が 33 ミリ秒未満で 20K フレーズまで拡張できるバイアスが可能になります。
フレーズレベルおよびワードピースレベルのクロスエントロピー損失を追加することで、私たちの技術は、損失や軽量フレーズ選択パスなしで、ベースラインに対して最大 37.5% の相対的な WER 削減も達成します。

要約(オリジナル)

Contextual biasing enables speech recognizers to transcribe important phrases in the speaker’s context, such as contact names, even if they are rare in, or absent from, the training data. Attention-based biasing is a leading approach which allows for full end-to-end cotraining of the recognizer and biasing system and requires no separate inference-time components. Such biasers typically consist of a context encoder; followed by a context filter which narrows down the context to apply, improving per-step inference time; and, finally, context application via cross attention. Though much work has gone into optimizing per-frame performance, the context encoder is at least as important: recognition cannot begin before context encoding ends. Here, we show the lightweight phrase selection pass can be moved before context encoding, resulting in a speedup of up to 16.1 times and enabling biasing to scale to 20K phrases with a maximum pre-decoding delay under 33ms. With the addition of phrase- and wordpiece-level cross-entropy losses, our technique also achieves up to a 37.5% relative WER reduction over the baseline without the losses and lightweight phrase selection pass.

arxiv情報

著者	Zelin Wu,Gan Song,Christopher Li,Pat Rondon,Zhong Meng,Xavier Velez,Weiran Wang,Diamantino Caseiro,Golan Pundak,Tsendsuren Munkhdalai,Angad Chandorkar,Rohit Prabhavalkar
発行日	2024-04-23 13:43:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Deferred NAM: Low-latency Top-K Context Injection via Deferred Context Encoding for Non-Streaming ASR

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー