Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

要約

この論文では、多言語 ASR システムによって生成された ASR トランスクリプトの句読点を復元する作業について説明します。
重点言語は、シンガポールで最も人気のある 3 つの言語である英語、北京語、マレー語です。
私たちの知る限り、これはこれら 3 つの言語の句読点の復元に同時に取り組むことができる最初のシステムです。
従来のアプローチでは通常、タスクを順次ラベル付けタスクとして扱いますが、この研究では、各単語境界における句読点の存在と種類を予測するスロット充填アプローチを採用しています。
このアプローチは、BERT の事前トレーニング段階で使用されるマスク言語モデルアプローチに似ていますが、マスクされた単語を予測する代わりに、私たちのモデルはマスクされた句読点を予測します。
さらに、XLM-R の組み込み SentencePiece トークナイザーのみを使用する代わりに Jieba1 を使用すると、句読点のある中国語トランスクリプトのパフォーマンスが大幅に向上することがわかりました。
英語と中国語の IWSLT2022 データセットとマレーニュースに関する実験結果は、提案されたアプローチが、英語とマレー語については妥当な F1 スコア (つまり 74.7% と 78) を維持しながら、中国語については 73.8% の F1 スコアという最先端の結果を達成したことを示しています。
％それぞれ。
結果を再現し、デモンストレーション目的で簡単な Web ベースのアプリケーションを構築できるソースコードは、Github で入手できます。

要約(オリジナル)

This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github.

arxiv情報

著者	Abhinav Rao,Ho Thi-Nga,Chng Eng-Siong
発行日	2024-12-02 00:57:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー