Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

要約

大規模言語モデル (LLM) の最近の進歩により、自動音声認識 (ASR) の生成誤り訂正 (GER) が促進されています。ASR は、デコードされた N ベスト仮説からグラウンドトゥルースの書き起こしを予測することを目的としています。
LLM の強力な言語生成能力と N-best リストの豊富な情報のおかげで、GER は ASR の結果を向上させる上で大きな効果を示します。
ただし、依然として 2 つの制限があります。1) LLM は GER 中にソース音声を認識しないため、文法的には正しいがソース音声の内容に違反する結果が生じる可能性があります。2) N ベスト仮説は通常、少数のトークンでしか変化しません。
そのため、すべてのトークンを GER に送信するのは冗長であり、どのトークンに注目すべきかについて LLM が混乱する可能性があり、その結果、誤訂正が増加する可能性があります。
この論文では、ASR 生成エラー訂正の新しいパラダイムである ClozeGER を提案します。
まず、マルチモーダル LLM (つまり、SpeechGPT) を導入して、ソース音声を追加入力として受信し、補正出力の忠実度を向上させます。
次に、GER をロジットキャリブレーションを備えたクローズテストとして再フォーマットして、入力情報の冗長性を削除し、明確な指示で GER を簡素化します。
実験では、ClozeGER が 9 つの一般的な ASR データセットでバニラ GER を超える新たなブレークスルーを達成したことが示されています。

要約(オリジナル)

Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets.

arxiv情報

著者	Yuchen Hu,Chen Chen,Chengwei Qin,Qiushi Zhu,Eng Siong Chng,Ruizhe Li
発行日	2024-05-16 12:05:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Listen Again and Choose the Right Answer: A New Paradigm for Automatic Speech Recognition with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー