Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

要約

我々は、神経音響モデルと大規模言語モデル（LLM）を融合することにより、音声対話における順番の交代と相槌の位置を継続的に予測するアプローチを提案します。
Switchboard の人間対人間の会話データセットでの実験では、私たちのアプローチが単一モダリティのベースラインモデルよりも一貫して優れていることが実証されています。
また、タスクと会話のコンテキストを理解するために LLM でエンコードされた知識をさらに活用するための、新しいマルチタスク命令の微調整戦略も開発し、さらなる改善につながります。
私たちのアプローチは、LLM と音響モデルを組み合わせて、人間と音声対応 AI エージェントの間でより自然な会話型の対話を実現できる可能性を示しています。

要約(オリジナル)

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

arxiv情報

著者	Jinhan Wang,Long Chen,Aparna Khare,Anirudh Raju,Pranav Dheram,Di He,Minhua Wu,Andreas Stolcke,Venkatesh Ravichandran
発行日	2024-01-26 08:59:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー