Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

要約

最近、音声テキストの事前トレーニング方法は、多くの音声および自然言語処理タスクで目覚ましい成功を収めています。
ただし、これまでの事前トレーニング済みモデルのほとんどは、通常、1 つまたは 2 つの特定のタスクに合わせて調整されていますが、広範囲の音声テキストタスクを克服することはできません。
さらに、既存の音声テキスト事前トレーニング方法では、対話内の文脈情報を調査して発話表現を豊かにすることができません。
この論文では、ExpliCiT cRoss-Modal Alignment (SPECTRA) を使用した、音声対話の理解のための音声テキスト対話事前トレーニングを提案します。これは、史上初の音声テキストダイアログ事前トレーニングモデルです。
具体的には、音声モダリティの時間性を考慮するために、音声とテキストのアライメントを捕捉するための新しい時間位置予測タスクを設計します。
この事前トレーニングタスクは、対応する音声波形内の各テキスト単語の開始時間と終了時間を予測することを目的としています。
さらに、音声対話の特性を学習するために、テキスト対話事前トレーニングから音声テキスト対話事前トレーニングシナリオへの応答選択タスクを一般化します。
4 つの異なるダウンストリーム音声テキストタスクに関する実験結果は、音声テキストの配置とマルチターンダイアログコンテキストの学習における SPECTRA の優位性を示しています。

要約(オリジナル)

Recently, speech-text pre-training methods have shown remarkable success in many speech and natural language processing tasks. However, most previous pre-trained models are usually tailored for one or two specific tasks, but fail to conquer a wide range of speech-text tasks. In addition, existing speech-text pre-training methods fail to explore the contextual information within a dialogue to enrich utterance representations. In this paper, we propose Speech-text dialog Pre-training for spoken dialog understanding with ExpliCiT cRoss-Modal Alignment (SPECTRA), which is the first-ever speech-text dialog pre-training model. Concretely, to consider the temporality of speech modality, we design a novel temporal position prediction task to capture the speech-text alignment. This pre-training task aims to predict the start and end time of each textual word in the corresponding speech waveform. In addition, to learn the characteristics of spoken dialogs, we generalize a response selection task from textual dialog pre-training to speech-text dialog pre-training scenarios. Experimental results on four different downstream speech-text tasks demonstrate the superiority of SPECTRA in learning speech-text alignment and multi-turn dialog context.

arxiv情報

著者	Tianshu Yu,Haoyu Gao,Ting-En Lin,Min Yang,Yuchuan Wu,Wentao Ma,Chao Wang,Fei Huang,Yongbin Li
発行日	2023-06-09 03:48:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Speech-Text Dialog Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー