Unified Speech-Text Pretraining for Spoken Dialog Modeling

要約

最近の研究では、音声を直接理解して合成するための大規模言語モデル (LLM) の機能を拡張する有望な結果が示されていますが、音声対話をモデル化するための LLM ベースの戦略は依然としてとらえどころがなく、さらなる研究が必要です。
この研究では、自動音声認識 (ASR) やテキスト変換に依存せずに、指定された入力音声に関連する有機的な韻律特徴を備えた一貫した音声応答を生成する、統合音声対話モデル (USDM) と呼ばれる広範な音声テキスト LLM フレームワークを提案しています。
音声（TTS）ソリューション。
私たちのアプローチでは、基礎となる LLM が示す推論連鎖機能を活用する、複数ステップの音声テキスト推論スキームを採用しています。
また、クロスモーダルセマンティクスの捕捉に役立つ、一般化された音声テキスト事前トレーニングスキームも提案します。
自動評価と人間による評価では、提案されたアプローチが自然な音声応答を生成するのに効果的であり、以前のベースラインとカスケードされたベースラインの両方を上回るパフォーマンスを示しています。
詳細な比較研究により、個別のコンポーネントではカスケードアプローチの方が強力であるにもかかわらず、共同音声テキストモデリングにより、認識エラーと音声品質に対する堅牢性が向上することが明らかになりました。
デモは https://unifiedsdm.github.io で利用できます。

要約(オリジナル)

While recent work shows promising results in expanding the capabilities of large language models (LLM) to directly understand and synthesize speech, an LLM-based strategy for modeling spoken dialogs remains elusive and calls for further investigation. This work proposes an extensive speech-text LLM framework, named the Unified Spoken Dialog Model (USDM), to generate coherent spoken responses with organic prosodic features relevant to the given input speech without relying on automatic speech recognition (ASR) or text-to-speech (TTS) solutions. Our approach employs a multi-step speech-text inference scheme that leverages chain-of-reasoning capabilities exhibited by the underlying LLM. We also propose a generalized speech-text pretraining scheme that helps with capturing cross-modal semantics. Automatic and human evaluations show that the proposed approach is effective in generating natural-sounding spoken responses, outperforming both prior and cascaded baselines. Detailed comparative studies reveal that, despite the cascaded approach being stronger in individual components, the joint speech-text modeling improves robustness against recognition errors and speech quality. Demo is available at https://unifiedsdm.github.io.

arxiv情報

著者	Heeseung Kim,Soonshin Seo,Kyeongseok Jeong,Ohsung Kwon,Jungwhan Kim,Jaehong Lee,Eunwoo Song,Myungwoo Oh,Sungroh Yoon,Kang Min Yoo
発行日	2024-02-08 14:35:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unified Speech-Text Pretraining for Spoken Dialog Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー