Contextualized Automatic Speech Recognition with Dynamic Vocabulary

要約

ディープバイアス (DB) は、バイアスリストを使用して、まれな単語や文脈上のフレーズに対するエンドツーエンドの自動音声認識 (E2E-ASR) モデルのパフォーマンスを強化します。
ただし、既存の方法のほとんどは、バイアスフレーズを事前定義された静的語彙内のサブワードのシーケンスとして扱います。
この単純なシーケンス分解により不自然なトークンパターンが生成され、その出現確率が大幅に低下します。
より高度な技術では、外部言語モデルの浅い融合や再スコアリングなどの追加モジュールで語彙を拡張することで、この問題に対処しています。
ただし、モジュールが追加されるため、作業負荷が増加します。
この論文では、推論中にバイアストークンを追加できる動的語彙を提案します。
バイアスリストの各エントリは、既存のサブワードトークンのシーケンスとは異なり、単一のトークンとして表されます。
このアプローチにより、バイアスフレーズ内のサブワードの依存関係を学習する必要がなくなります。
この方法は、一般的な E2E-ASR アーキテクチャの埋め込み層と出力層を拡張するだけなので、さまざまなアーキテクチャに簡単に適用できます。
実験結果は、提案手法が従来の DB 手法と比較して、英語および日本語データセットのバイアスフレーズ WER を 3.1 ～ 4.9 ポイント改善することを示しています。

要約(オリジナル)

Deep biasing (DB) enhances the performance of end-to-end automatic speech recognition (E2E-ASR) models for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary. This naive sequence decomposition produces unnatural token patterns, significantly lowering their occurrence probability. More advanced techniques address this problem by expanding the vocabulary with additional modules, including the external language model shallow fusion or rescoring. However, they result in increasing the workload due to the additional modules. This paper proposes a dynamic vocabulary where bias tokens can be added during inference. Each entry in a bias list is represented as a single token, unlike a sequence of existing subword tokens. This approach eliminates the need to learn subword dependencies within the bias phrases. This method is easily applied to various architectures because it only expands the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the bias phrase WER on English and Japanese datasets by 3.1 — 4.9 points compared with the conventional DB method.

arxiv情報

著者	Yui Sudo,Yosuke Fukumoto,Muhammad Shakeel,Yifan Peng,Shinji Watanabe
発行日	2024-08-30 07:43:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Contextualized Automatic Speech Recognition with Dynamic Vocabulary

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー