Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

要約

言語モデルの最近の進歩は大幅な進歩を遂げています。
GPT-4o は新たなマイルストーンとして、人間とのリアルタイム会話を可能にし、人間に近い自然な流暢さを実証しました。
このような人間とコンピューターの対話には、オーディオモダリティを使用して直接推論を実行し、ストリーミングで出力を生成する機能を備えたモデルが必要です。
ただし、これは現在の学術モデルでは依然として実現できません。通常、これらのモデルは音声合成のために追加の TTS システムに依存しており、望ましくない遅延が発生するためです。
このペーパーでは、リアルタイムの音声対話が可能なオーディオベースのエンドツーエンドの会話モデルである Mini-Omni を紹介します。
この機能を実現するために、パフォーマンスをさらに向上させる推論中のバッチ並列戦略とともに、テキスト命令による音声生成方法を提案します。
また、私たちの方法は、元のモデルの言語機能を最小限の劣化で保持するのに役立ち、他の作品がリアルタイムの対話機能を確立できるようにします。
私たちはこのトレーニング方法を「Any Model Can Talk」と呼んでいます。
また、音声出力用に最適化されたモデルを微調整するための VoiceAssistant-400K データセットも紹介します。
私たちの知る限り、Mini-Omni はリアルタイム音声対話のための初の完全なエンドツーエンドのオープンソースモデルであり、将来の研究に貴重な可能性をもたらします。

要約(オリジナル)

Recent advances in language models have achieved significant progress. GPT-4o, as a new milestone, has enabled real-time conversations with humans, demonstrating near-human natural fluency. Such human-computer interaction necessitates models with the capability to perform reasoning directly with the audio modality and generate output in streaming. However, this remains beyond the reach of current academic models, as they typically depend on extra TTS systems for speech synthesis, resulting in undesirable latency. This paper introduces the Mini-Omni, an audio-based end-to-end conversational model, capable of real-time speech interaction. To achieve this capability, we propose a text-instructed speech generation method, along with batch-parallel strategies during inference to further boost the performance. Our method also helps to retain the original model’s language capabilities with minimal degradation, enabling other works to establish real-time interaction capabilities. We call this training method ‘Any Model Can Talk’. We also introduce the VoiceAssistant-400K dataset to fine-tune models optimized for speech output. To our best knowledge, Mini-Omni is the first fully end-to-end, open-source model for real-time speech interaction, offering valuable potential for future research.

arxiv情報

著者	Zhifei Xie,Changqiao Wu
発行日	2024-08-30 02:53:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー