Text-Driven Foley Sound Generation With Latent Diffusion Model

要約

フォーリーサウンド生成は、マルチメディアコンテンツの背景サウンドを合成することを目的としています。
以前のモデルは通常、ラベルを入力として持つ大規模な開発セット (単一の数値やワンホットベクトルなど) を採用しています。
この研究では、テキスト条件を備えたフォーリーサウンド生成のための拡散モデルベースのシステムを提案します。
データ不足の問題を軽減するために、モデルは最初に大規模なデータセットで事前トレーニングされ、対照的言語音声関連 (CLAP) 手法を使用した転移学習によってこのタスクに合わせて微調整されます。
私たちは、テキストエンコーダーによって抽出された特徴の埋め込みが生成モデルのパフォーマンスに大きな影響を与える可能性があることを観察しました。
したがって、エンコーダーによって生成されるテキストの埋め込みを改善するために、エンコーダーの後にトレーニング可能なレイヤーを導入します。
さらに、複数の候補オーディオクリップを同時に生成し、候補クリップの埋め込みとターゲットテキストラベルの埋め込みの間の類似性スコアの観点から決定される最良のクリップを選択することで、生成された波形をさらに改良します。
提案された手法を使用すると、私たちのシステムは、DCASE Challenge 2023 タスク 7 に提出されたシステムの中で ${1}^{st}$ にランクされます。アブレーション研究の結果は、提案された手法が音声生成パフォーマンスを大幅に向上させることを示しています。
提案されたシステムを実装するためのコードはオンラインで入手できます。

要約(オリジナル)

Foley sound generation aims to synthesise the background sound for multimedia content. Previous models usually employ a large development set with labels as input (e.g., single numbers or one-hot vector). In this work, we propose a diffusion model based system for Foley sound generation with text conditions. To alleviate the data scarcity issue, our model is initially pre-trained with large-scale datasets and fine-tuned to this task via transfer learning using the contrastive language-audio pertaining (CLAP) technique. We have observed that the feature embedding extracted by the text encoder can significantly affect the performance of the generation model. Hence, we introduce a trainable layer after the encoder to improve the text embedding produced by the encoder. In addition, we further refine the generated waveform by generating multiple candidate audio clips simultaneously and selecting the best one, which is determined in terms of the similarity score between the embedding of the candidate clips and the embedding of the target text label. Using the proposed method, our system ranks ${1}^{st}$ among the systems submitted to DCASE Challenge 2023 Task 7. The results of the ablation studies illustrate that the proposed techniques significantly improve sound generation performance. The codes for implementing the proposed system are available online.

arxiv情報

著者	Yi Yuan,Haohe Liu,Xubo Liu,Xiyuan Kang,Peipei Wu,Mark D. Plumbley,Wenwu Wang
発行日	2023-09-18 10:35:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text-Driven Foley Sound Generation With Latent Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー