Retrieval-Augmented Text-to-Audio Generation

要約

Text-to-Audio (TTA) 生成における最近の進歩にも関わらず、AudioCaps などの不均衡なクラス分布を持つデータセットでトレーニングされた AudioLDM などの最先端のモデルは、生成パフォーマンスに偏りがあることを示します。
。
具体的には、一般的なオーディオクラスの生成には優れていますが、まれなオーディオクラスの生成ではパフォーマンスが劣るため、全体的な生成パフォーマンスが低下します。
この問題をロングテールのテキストからオーディオへの生成と呼びます。
この問題に対処するために、TTA モデルに対する単純な検索拡張アプローチを提案します。
具体的には、入力テキストプロンプトが与えられると、まず Contrastive Language Audio Pretraining (CLAP) モデルを活用して、関連するテキストと音声のペアを取得します。
取得された音声テキストデータの特徴は、TTA モデルの学習をガイドするための追加条件として使用されます。
私たちは提案したアプローチで AudioLDM を強化し、結果として得られる拡張システムを Re-AudioLDM と呼びます。
AudioCaps データセット上で、Re-AudioLDM は 1.37 という最先端の Frechet Audio Distance (FAD) を達成し、既存のアプローチを大幅に上回ります。
さらに、Re-AudioLDM が複雑なシーン、まれなオーディオクラス、さらには目に見えないオーディオタイプに対してもリアルなオーディオを生成できることを示し、TTA タスクにおける可能性を示しています。

要約(オリジナル)

Despite recent progress in text-to-audio (TTA) generation, we show that the state-of-the-art models, such as AudioLDM, trained on datasets with an imbalanced class distribution, such as AudioCaps, are biased in their generation performance. Specifically, they excel in generating common audio classes while underperforming in the rare ones, thus degrading the overall generation performance. We refer to this problem as long-tailed text-to-audio generation. To address this issue, we propose a simple retrieval-augmented approach for TTA models. Specifically, given an input text prompt, we first leverage a Contrastive Language Audio Pretraining (CLAP) model to retrieve relevant text-audio pairs. The features of the retrieved audio-text data are then used as additional conditions to guide the learning of TTA models. We enhance AudioLDM with our proposed approach and denote the resulting augmented system as Re-AudioLDM. On the AudioCaps dataset, Re-AudioLDM achieves a state-of-the-art Frechet Audio Distance (FAD) of 1.37, outperforming the existing approaches by a large margin. Furthermore, we show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types, indicating its potential in TTA tasks.

arxiv情報

著者	Yi Yuan,Haohe Liu,Xubo Liu,Qiushi Huang,Mark D. Plumbley,Wenwu Wang
発行日	2024-01-05 14:10:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Retrieval-Augmented Text-to-Audio Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー