Updated Corpora and Benchmarks for Long-Form Speech Recognition

要約

ASR 研究の大部分は、トレーニングデータとテストデータの両方が発話に事前に分割されているコーパスを使用します。
ただし、実際の ASR ユースケースのほとんどでは、テスト音声がセグメント化されていないため、推論時の条件とセグメント化された発話でトレーニングされたモデルとの間に不一致が生じます。
このペーパーでは、長い形式の ASR 研究に使用できるように、更新された転写とアライメントを備えた 3 つの標準 ASR コーパス (TED-LIUM 3、Gigapeech、および VoxPopuli-en) を再リリースします。
これらの再構成されたコーパスを使用して、トランスデューサーと注意ベースのエンコーダー/デコーダー (AED) のトレーニングとテストの不一致の問題を研究し、AED がこの問題の影響を受けやすいことを確認しました。
最後に、これらのモデルの単純な長い形式のトレーニングをベンチマークし、このドメインシフト下でのモデルの堅牢性に対するその有効性を示します。

要約(オリジナル)

The vast majority of ASR research uses corpora in which both the training and test data have been pre-segmented into utterances. In most real-word ASR use-cases, however, test audio is not segmented, leading to a mismatch between inference-time conditions and models trained on segmented utterances. In this paper, we re-release three standard ASR corpora – TED-LIUM 3, Gigapeech, and VoxPopuli-en – with updated transcription and alignments to enable their use for long-form ASR research. We use these reconstituted corpora to study the train-test mismatch problem for transducers and attention-based encoder-decoders (AEDs), confirming that AEDs are more susceptible to this issue. Finally, we benchmark a simple long-form training for these models, showing its efficacy for model robustness under this domain shift.

arxiv情報

著者	Jennifer Drexler Fox,Desh Raj,Natalie Delworth,Quinn McNamara,Corey Miller,Migüel Jetté
発行日	2023-09-26 15:32:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Updated Corpora and Benchmarks for Long-Form Speech Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー