Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference

要約

大規模言語モデル (LLM) は、オーディオを個別のトークンに変換するオーディオコーデックによる大幅に高度なオーディオ処理を備えており、言語モデリング技術をオーディオデータに適用できるようになります。
ただし、オーディオコーデックは高いフレームレートで動作することが多く、特に自己回帰モデルの場合、トレーニングと推論が遅くなります。
この課題に対処するために、低フレームレート音声コーデック (LFSC) を紹介します。これは、有限スカラー量子化と大規模音声言語モデルによる敵対的トレーニングを利用して、1.89 kbps ビットレートと 21.5 フレームの高品質音声圧縮を実現するニューラルオーディオコーデックです。
毎秒。
私たちは、新しいコーデックが LLM ベースのテキスト読み上げモデルの推論を約 3 倍高速にしながら、明瞭性を向上させ、以前のモデルと同等の品質を実現できることを実証します。

要約(オリジナル)

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.

arxiv情報

著者	Edresson Casanova,Ryan Langman,Paarth Neekhara,Shehzeen Hussain,Jason Li,Subhankar Ghosh,Ante Jukić,Sang-gil Lee
発行日	2024-09-18 16:39:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー