LLaMA-Omni: Seamless Speech Interaction with Large Language Models

要約

GPT-4o のようなモデルは、音声による大規模言語モデル (LLM) とのリアルタイムの対話を可能にし、従来のテキストベースの対話と比較してユーザーエクスペリエンスを大幅に向上させます。
ただし、オープンソース LLM に基づいて音声対話モデルを構築する方法については、まだ調査が不足しています。
これに対処するために、LLaMA-Omni を提案します。LLaMA-Omni は、LLM との低遅延で高品質な音声対話のために設計された新しいモデルアーキテクチャです。
LLaMA-Omni は、事前トレーニングされた音声エンコーダー、音声アダプター、LLM、およびストリーミング音声デコーダーを統合します。
音声の書き起こしの必要性がなくなり、非常に低い遅延で音声命令から直接テキストと音声応答を同時に生成できます。
最新の Llama-3.1-8B-Instruct モデルに基づいてモデルを構築します。
モデルを音声対話シナリオに合わせるために、200K の音声命令と対応する音声応答を含む InstructS2S-200K という名前のデータセットを構築します。
実験結果によると、以前の音声言語モデルと比較して、LLaMA-Omni はコンテンツとスタイルの両方で優れた応答を提供し、応答遅延は 226 ミリ秒という低さです。
さらに、LLaMA-Omni のトレーニングには 4 つの GPU だけで 3 日もかかりません。これにより、将来の音声言語モデルの効率的な開発への道が開かれます。

要約(オリジナル)

Models like GPT-4o enable real-time interaction with large language models (LLMs) through speech, significantly enhancing user experience compared to traditional text-based interaction. However, there is still a lack of exploration on how to build speech interaction models based on open-source LLMs. To address this, we propose LLaMA-Omni, a novel model architecture designed for low-latency and high-quality speech interaction with LLMs. LLaMA-Omni integrates a pretrained speech encoder, a speech adaptor, an LLM, and a streaming speech decoder. It eliminates the need for speech transcription, and can simultaneously generate text and speech responses directly from speech instructions with extremely low latency. We build our model based on the latest Llama-3.1-8B-Instruct model. To align the model with speech interaction scenarios, we construct a dataset named InstructS2S-200K, which includes 200K speech instructions and corresponding speech responses. Experimental results show that compared to previous speech-language models, LLaMA-Omni provides better responses in both content and style, with a response latency as low as 226ms. Additionally, training LLaMA-Omni takes less than 3 days on just 4 GPUs, paving the way for the efficient development of speech-language models in the future.

arxiv情報

著者	Qingkai Fang,Shoutao Guo,Yan Zhou,Zhengrui Ma,Shaolei Zhang,Yang Feng
発行日	2024-09-10 17:34:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー