FunASR: A Fundamental End-to-End Speech Recognition Toolkit

要約

このペーパーでは、学術研究と産業アプリケーションの間のギャップを埋めるために設計されたオープンソースの音声認識ツールキットである FunASR を紹介します。
FunASR は、大規模な産業コーパスでトレーニングされたモデルと、それらをアプリケーションに展開する機能を提供します。
このツールキットの主力モデルである Paraformer は、60,000 時間の音声を含む手動で注釈が付けられた中国語音声認識データセットでトレーニングされた非自己回帰のエンドツーエンド音声認識モデルです。
Paraformer のパフォーマンスを向上させるために、標準の Paraformer バックボーンにタイムスタンプ予測機能とホットワードカスタマイズ機能を追加しました。
さらに、モデルの展開を容易にするために、フィードフォワードシーケンシャルメモリネットワーク (FSMN-VAD) に基づく音声アクティビティ検出モデルと、制御可能な時間遅延トランスフォーマー (CT-Transformer) に基づくテキスト後処理句読点モデルをオープンソース化しました。
）、両方とも産業コーパスについてトレーニングを受けました。
これらの機能モジュールは、高精度の長時間音声音声認識サービスを構築するための強固な基盤を提供します。
オープンデータセットでトレーニングされた他のモデルと比較して、Paraformer は優れたパフォーマンスを示します。

要約(オリジナル)

This paper introduces FunASR, an open-source speech recognition toolkit designed to bridge the gap between academic research and industrial applications. FunASR offers models trained on large-scale industrial corpora and the ability to deploy them in applications. The toolkit’s flagship model, Paraformer, is a non-autoregressive end-to-end speech recognition model that has been trained on a manually annotated Mandarin speech recognition dataset that contains 60,000 hours of speech. To improve the performance of Paraformer, we have added timestamp prediction and hotword customization capabilities to the standard Paraformer backbone. In addition, to facilitate model deployment, we have open-sourced a voice activity detection model based on the Feedforward Sequential Memory Network (FSMN-VAD) and a text post-processing punctuation model based on the controllable time-delay Transformer (CT-Transformer), both of which were trained on industrial corpora. These functional modules provide a solid foundation for building high-precision long audio speech recognition services. Compared to other models trained on open datasets, Paraformer demonstrates superior performance.

arxiv情報

著者	Zhifu Gao,Zerui Li,Jiaming Wang,Haoneng Luo,Xian Shi,Mengzhe Chen,Yabin Li,Lingyun Zuo,Zhihao Du,Zhangyu Xiao,Shiliang Zhang
発行日	2023-05-18 14:45:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー