Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

要約

オーディオの理解と生成をシームレスに統合するエンドツーエンドのオーディオ大規模な言語モデルであるBaichuan-Audioを紹介します。
テキスト誘導されたアライメントされた音声生成メカニズムを備えており、理解と生成の両方の能力を備えたリアルタイムの音声相互作用を可能にします。
Baichuan-Audioは、事前に訓練されたASRモデルを活用し、その後、12.5 Hzのフレームレートでの音声のマルチコードブック離散化が続きます。
このマルチコードブックのセットアップにより、音声トークンがセマンティック情報とアコースティック情報の両方を保持することが保証されます。
モデリングをさらに強化するために、独自の特性を効果的にキャプチャするために、独立したオーディオヘッドが採用されています。
トレーニング前のインテリジェンスの喪失を軽減し、LLMの元の機能を保存するために、オーディオモデリングを強化しながら言語理解を維持する2段階のトレーニング戦略を提案します。
アライメントに続いて、このモデルはリアルタイムの音声ベースの会話に優れており、傑出した質問をする能力を示し、その汎用性と効率性を示しています。
提案されたモデルは、リアルタイムの話し言葉で優れたパフォーマンスを示し、強力な質問回答能力を示します。
当社のコード、モデル、トレーニングデータは、https：//github.com/baichuan-inc/baichuan-audioで入手できます。

要約(オリジナル)

We introduce Baichuan-Audio, an end-to-end audio large language model that seamlessly integrates audio understanding and generation. It features a text-guided aligned speech generation mechanism, enabling real-time speech interaction with both comprehension and generation capabilities. Baichuan-Audio leverages a pre-trained ASR model, followed by multi-codebook discretization of speech at a frame rate of 12.5 Hz. This multi-codebook setup ensures that speech tokens retain both semantic and acoustic information. To further enhance modeling, an independent audio head is employed to process audio tokens, effectively capturing their unique characteristics. To mitigate the loss of intelligence during pre-training and preserve the original capabilities of the LLM, we propose a two-stage pre-training strategy that maintains language understanding while enhancing audio modeling. Following alignment, the model excels in real-time speech-based conversation and exhibits outstanding question-answering capabilities, demonstrating its versatility and efficiency. The proposed model demonstrates superior performance in real-time spoken dialogue and exhibits strong question-answering abilities. Our code, model and training data are available at https://github.com/baichuan-inc/Baichuan-Audio

arxiv情報

著者	Tianpeng Li,Jun Liu,Tao Zhang,Yuanbo Fang,Da Pan,Mingrui Wang,Zheng Liang,Zehuan Li,Mingan Lin,Guosheng Dong,Jianhua Xu,Haoze Sun,Zenan Zhou,Weipeng Chen
発行日	2025-02-24 15:16:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー