Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

要約

本論文では、新しいFace-to-Face音声対話モデルを紹介する。これは、ユーザ入力からオーディオビジュアル音声を処理し、応答としてオーディオビジュアル音声を生成するものであり、中間テキストに依存しないアバターチャットボットシステムの実現に向けた最初の一歩となる。この目的のために、我々は新たにMultiDialogを紹介する。MultiDialogは、オープンドメインの対話データセットであるTopicalChatに基づいて収録された340時間、約9,000の対話を含む、初の大規模なマルチモーダル（すなわち、音声と視覚）音声対話コーパスである。MultiDialogには、与えられたスクリプトに従って行動する会話相手のオーディオビジュアル録音が感情アノテーション付きで並列に収録されており、マルチモーダル合成の研究機会を開くことが期待される。我々のFace-to-Face音声対話モデルは、テキストで事前訓練された大規模な言語モデルを組み込み、音声とテキストの共同事前訓練を組み込むことにより、オーディオビジュアル音声対話領域に適応させる。広範な実験を通して、対面会話を促進する上での我々のモデルの有効性を検証する。デモとデータはそれぞれ https://multidialog.github.io と https://huggingface.co/datasets/IVLLab/MultiDialog で利用可能である。

要約(オリジナル)

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at https://multidialog.github.io and https://huggingface.co/datasets/IVLLab/MultiDialog, respectively.

arxiv情報

著者	Se Jin Park,Chae Won Kim,Hyeongseop Rha,Minsu Kim,Joanna Hong,Jeong Hun Yeo,Yong Man Ro
発行日	2024-08-02 15:05:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー