AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

要約

オーディオ生成には、音声、音楽、効果音など、さまざまなタイプのオーディオに共通点がありますが、各タイプのモデルを設計するには、他のタイプのモデルとは大きく異なる可能性がある特定の目的とバイアスを慎重に考慮する必要があります。
オーディオ生成の統一的な観点に近づけるために、この論文では、音声、音楽、効果音の生成に同じ学習方法を利用するフレームワークを提案します。
私たちのフレームワークでは、Language of Audio (LOA) と呼ばれるオーディオの一般的な表現が導入されています。
自己監視型の事前トレーニング済み表現学習モデルである AudioMAE に基づいて、あらゆる音声を LOA に変換できます。
生成プロセスでは、GPT-2 モデルを使用してあらゆるモダリティを LOA に変換し、LOA を条件とした潜在拡散モデルを使用して自己教師あり音声生成学習を実行します。
提案されたフレームワークは、当然のことながら、コンテキスト内学習機能や、再利用可能な自己教師付き事前トレーニング済み AudioMAE および潜在拡散モデルなどの利点をもたらします。
テキストからオーディオへの変換、テキストから音楽への変換、およびテキストから音声への変換の主要なベンチマークに関する実験により、新しい最先端のパフォーマンス、または以前のアプローチに匹敵するパフォーマンスが実証されました。
デモとコードは https://audioldm.github.io/audioldm2 で入手できます。

要約(オリジナル)

Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches. Our demo and code are available at https://audioldm.github.io/audioldm2.

arxiv情報

著者	Haohe Liu,Qiao Tian,Yi Yuan,Xubo Liu,Xinhao Mei,Qiuqiang Kong,Yuping Wang,Wenwu Wang,Yuxuan Wang,Mark D. Plumbley
発行日	2023-08-10 17:55:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー