Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

要約

最近の音声アシスタントは通常、自動音声認識 (ASR) エンジンと自然言語理解 (NLU) システムで構成されるカスケード音声言語理解 (SLU) ソリューションに基づいています。
このようなアプローチは ASR 出力に依存しているため、いわゆる ASR エラー伝播の影響を受けることがよくあります。
この研究では、BERT や RoBERTa などの事前トレーニング済み言語モデル (PLM) に基づく最先端の NLU システムに対するこの ASR エラー伝播の影響を調査します。
さらに、ASR トランスクリプトに存在するエラーによって引き起こされる SLU パフォーマンスの低下を軽減するために、マルチモーダル言語理解 (MLU) モジュールが提案されています。
MLU は、オーディオとテキストの両方のモダリティ、特に音声の場合は Wav2Vec、言語の場合は Bert/RoBERTa の両方から学習した自己監視機能の恩恵を受けます。
当社の MLU は、オーディオ信号を埋め込むためのエンコーダーネットワークと、テキストトランスクリプトを処理するためのテキストエンコーダーと、その後にオーディオとテキストロジットを融合するための後期融合層を組み合わせています。
我々は、提案された MLU が低品質の ASR 転写物に対して堅牢であることを示しているが、BERT と RoBERTa のパフォーマンスは著しく損なわれていることを発見しました。
私たちのモデルは 3 つの SLU データセットからの 5 つのタスクで評価され、堅牢性は 3 つの ASR エンジンからの ASR トランスクリプトを使用してテストされます。
結果は、提案されたアプローチが ASR エラー伝播問題を効果的に軽減し、アカデミック ASR エンジンのすべてのデータセットにわたる PLM モデルのパフォーマンスを上回っていることを示しています。

要約(オリジナル)

Recent voice assistants are usually based on the cascade spoken language understanding (SLU) solution, which consists of an automatic speech recognition (ASR) engine and a natural language understanding (NLU) system. Because such approach relies on the ASR output, it often suffers from the so-called ASR error propagation. In this work, we investigate impacts of this ASR error propagation on state-of-the-art NLU systems based on pre-trained language models (PLM), such as BERT and RoBERTa. Moreover, a multimodal language understanding (MLU) module is proposed to mitigate SLU performance degradation caused by errors present in the ASR transcript. The MLU benefits from self-supervised features learned from both audio and text modalities, specifically Wav2Vec for speech and Bert/RoBERTa for language. Our MLU combines an encoder network to embed the audio signal and a text encoder to process text transcripts followed by a late fusion layer to fuse audio and text logits. We found that the proposed MLU showed to be robust towards poor quality ASR transcripts, while the performance of BERT and RoBERTa are severely compromised. Our model is evaluated on five tasks from three SLU datasets and robustness is tested using ASR transcripts from three ASR engines. Results show that the proposed approach effectively mitigates the ASR error propagation problem, surpassing the PLM models’ performance across all datasets for the academic ASR engine.

arxiv情報

著者	Anderson R. Avila,Mehdi Rezagholizadeh,Chao Xing
発行日	2023-06-13 15:41:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Audio-textual Architecture for Robust Spoken Language Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー