Advancing Multi-talker ASR Performance with Large Language Models

要約

会話シナリオで複数の話者から重複する音声を認識することは、自動音声認識 (ASR) にとって最も困難な問題の 1 つです。
シリアル化出力トレーニング (SOT) は、複数の話者による ASR に対処する古典的な方法であり、トレーニングのために音声の発声時間に従って複数の話者からの文字起こしを連結するというアイデアを備えています。
ただし、会話内の複数の関連する発話を連結して得られる SOT スタイルの文字起こしは、長いコンテキストのモデル化に大きく依存します。
したがって、アテンションベースのエンコーダデコーダ (AED) アーキテクチャにおけるエンコーダのパフォーマンスを主に重視する従来の方法と比較して、事前トレーニングされたデコーダの機能を活用する大規模言語モデル (LLM) を利用する新しいアプローチは、このような複雑で複雑なアーキテクチャには適している可能性があります。
挑戦的なシナリオ。
この論文では、事前トレーニングされた音声エンコーダと LLM を活用し、適切な戦略を使用してマルチトーカーデータセット上でそれらを微調整する、マルチトーカー ASR に対する LLM ベースの SOT アプローチを提案します。
実験結果は、私たちのアプローチが、シミュレートされたデータセット LibriMix での従来の AED ベースの手法を超え、現実世界のデータセット AMI の評価セットで最先端のパフォーマンスを達成し、1000 倍の教師付きデータでトレーニングされた AED モデルを上回るパフォーマンスを示していることを示しています。
過去の作品では。

要約(オリジナル)

Recognizing overlapping speech from multiple speakers in conversational scenarios is one of the most challenging problem for automatic speech recognition (ASR). Serialized output training (SOT) is a classic method to address multi-talker ASR, with the idea of concatenating transcriptions from multiple speakers according to the emission times of their speech for training. However, SOT-style transcriptions, derived from concatenating multiple related utterances in a conversation, depend significantly on modeling long contexts. Therefore, compared to traditional methods that primarily emphasize encoder performance in attention-based encoder-decoder (AED) architectures, a novel approach utilizing large language models (LLMs) that leverages the capabilities of pre-trained decoders may be better suited for such complex and challenging scenarios. In this paper, we propose an LLM-based SOT approach for multi-talker ASR, leveraging pre-trained speech encoder and LLM, fine-tuning them on multi-talker dataset using appropriate strategies. Experimental results demonstrate that our approach surpasses traditional AED-based methods on the simulated dataset LibriMix and achieves state-of-the-art performance on the evaluation set of the real-world dataset AMI, outperforming the AED model trained with 1000 times more supervised data in previous works.

arxiv情報

著者	Mohan Shi,Zengrui Jin,Yaoxun Xu,Yong Xu,Shi-Xiong Zhang,Kun Wei,Yiwen Shao,Chunlei Zhang,Dong Yu
発行日	2024-08-30 17:29:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Advancing Multi-talker ASR Performance with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー