Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

要約

Whisper は、堅牢かつ大規模な多言語音声認識モデルとして、多くの低リソースおよび配布外のシナリオで優れた結果を実証してきました。
ただし、そのエンコーダ/デコーダ構造により、ストリーミング音声認識への適用が妨げられます。
このペーパーでは、Whisper のクロスアテンションに組み込まれたタイムアライメントを使用して、自動回帰デコードをガイドし、事前トレーニングされたモデルを微調整することなくチャンクベースのストリーミング ASR を実現する Simul-Whisper を紹介します。
さらに、チャンク境界で切り詰められた単語がデコード結果に及ぼす悪影響を観察し、この問題に対処するための統合および発射ベースの切り詰め検出モデルを提案します。
複数の言語と Whisper アーキテクチャに関する実験では、Simul-Whisper が 1 秒のチャンクサイズで平均絶対単語誤り率の低下がわずか 1.46% に達し、現在の最先端のベースラインを大幅に上回っていることが示されています。

要約(オリジナル)

As a robust and large-scale multilingual speech recognition model, Whisper has demonstrated impressive results in many low-resource and out-of-distribution scenarios. However, its encoder-decoder structure hinders its application to streaming speech recognition. In this paper, we introduce Simul-Whisper, which uses the time alignment embedded in Whisper’s cross-attention to guide auto-regressive decoding and achieve chunk-based streaming ASR without any fine-tuning of the pre-trained model. Furthermore, we observe the negative effect of the truncated words at the chunk boundaries on the decoding results and propose an integrate-and-fire-based truncation detection model to address this issue. Experiments on multiple languages and Whisper architectures show that Simul-Whisper achieves an average absolute word error rate degradation of only 1.46% at a chunk size of 1 second, which significantly outperforms the current state-of-the-art baseline.

arxiv情報

著者	Haoyu Wang,Guoqiang Hu,Guodong Lin,Wei-Qiang Zhang,Jian Li
発行日	2024-06-14 14:07:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Simul-Whisper: Attention-Guided Streaming Whisper with Truncation Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー