GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

要約

GLM-4-Voiceを紹介します。GLM-4-Voiceは、インテリジェントで人間のようなエンドツーエンドの音声チャットボットです。GLM-4-Voiceは中国語と英語をサポートし、リアルタイムで音声会話を行い、ユーザーの指示に応じて、感情、イントネーション、発話速度、方言などの音声ニュアンスを変化させます。GLM-4-Voiceは、ベクトル量子化ボトルネックをエンコーダーに組み込むことで、自動音声認識(ASR)モデルに由来するフレームレート12.5Hzの超低ビットレート(175bps)、シングルコードブック・スピートークナイザーを使用しています。テキストから音声モダリティへ効率的に知識を伝達するために、テキスト-トークン・モデルを用いて、既存のテキスト事前学習コーパスから音声-テキスト・インターリーブ・データを合成する。事前学習済みテキスト言語モデルGLM-4-9Bから、教師なし音声データ、インターリーブ音声-テキストデータ、教師あり音声-テキストデータを組み合わせて事前学習を継続し、最大1兆トークンまでスケールアップすることで、音声言語モデリングと音声質問応答の両方で最先端の性能を達成する。次に、高品質な会話音声データを用いて事前訓練されたモデルを微調整し、会話能力と音声品質の両方において、既存のベースラインと比較して優れた性能を達成しました。オープンモデルは、https://github.com/THUDM/GLM-4-Voice および https://huggingface.co/THUDM/glm-4-voice-9b からアクセスできます。

要約(オリジナル)

We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It supports both Chinese and English, engages in real-time voice conversations, and varies vocal nuances such as emotion, intonation, speech rate, and dialect according to user instructions. GLM-4-Voice uses an ultra-low bitrate (175bps), single-codebook speech tokenizer with 12.5Hz frame rate derived from an automatic speech recognition (ASR) model by incorporating a vector-quantized bottleneck into the encoder. To efficiently transfer knowledge from text to speech modalities, we synthesize speech-text interleaved data from existing text pre-training corpora using a text-to-token model. We continue pre-training from the pre-trained text language model GLM-4-9B with a combination of unsupervised speech data, interleaved speech-text data, and supervised speech-text data, scaling up to 1 trillion tokens, achieving state-of-the-art performance in both speech language modeling and spoken question answering. We then fine-tune the pre-trained model with high-quality conversational speech data, achieving superior performance compared to existing baselines in both conversational ability and speech quality. The open models can be accessed through https://github.com/THUDM/GLM-4-Voice and https://huggingface.co/THUDM/glm-4-voice-9b.

arxiv情報

著者	Aohan Zeng,Zhengxiao Du,Mingdao Liu,Kedong Wang,Shengmin Jiang,Lei Zhao,Yuxiao Dong,Jie Tang
発行日	2024-12-03 17:41:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー