Audio Generation with Multiple Conditional Diffusion Model

要約

テキストベースのオーディオ生成モデルには、オーディオ内のすべての情報を網羅できないため制限があり、テキストのみに依存すると制御性が制限されます。
この問題に対処するために、コンテンツ (タイムスタンプ) やスタイル (ピッチ輪郭とエネルギー輪郭) などの追加条件をテキストへの補足として組み込むことで、既存の事前トレーニング済みテキスト音声変換モデルの制御性を強化する新しいモデルを提案します。
このアプローチにより、生成されるオーディオの時間的順序、ピッチ、エネルギーをきめ細かく制御できます。
生成の多様性を維持するために、大規模な言語モデルによって強化されたトレーニング可能な制御条件エンコーダーと、事前トレーニングされたテキストからオーディオへのモデルの重みを維持しながら追加の条件をエンコードおよび融合するトレーニング可能な Fusion-Net を採用します。
凍った。
適切なデータセットと評価指標が不足しているため、既存のデータセットをオーディオと対応する条件で構成される新しいデータセットに統合し、一連の評価指標を使用して制御性能を評価します。
実験結果は、私たちのモデルがきめ細かい制御を実現し、制御可能なオーディオ生成を達成することに成功したことを示しています。
音声サンプルとデータセットは、https://conditionaudiogen.github.io/conditionaudiogen/ で公開されています。

要約(オリジナル)

Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at https://conditionaudiogen.github.io/conditionaudiogen/

arxiv情報

著者	Zhifang Guo,Jianguo Mao,Rui Tao,Long Yan,Kazushige Ouchi,Hong Liu,Xiangdong Wang
発行日	2023-12-17 06:01:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Audio Generation with Multiple Conditional Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー