Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

要約

拡散ベースの生成音声強調 (SE) が最近注目を集めていますが、逆拡散には依然として時間がかかります。
1 つの解決策は、予測 SE システムによって推定された強化された特徴を使用して逆拡散プロセスを初期化することです。
ただし、パイプライン構造は現在、生成デコーダーと予測デコーダーの併用を考慮していません。
予測デコーダを使用すると、予測と拡散ベースの生成 SE 間のさらなる相補性を利用できるようになります。
この論文では、2 つのレベルにわたって生成デコーダと予測デコーダを共同で使用する統合システムを提案します。
エンコーダーは、共有エンコードレベルで生成情報と予測情報の両方をエンコードします。
デコードされた特徴レベルでは、生成デコーダーと予測デコーダーによって 2 つのデコードされた特徴を融合します。
具体的には、2 つの SE モジュールが最初と最後の拡散ステップで融合されます。最初の融合では、予測 SE で拡散プロセスを初期化して収束性を向上させ、最後の融合では 2 つの相補的な SE 出力を組み合わせて SE のパフォーマンスを向上させます。
Voice-Bank データセットに対して行われた実験では、予測情報を組み込むと、他のスコアベースの拡散 SE (StoRM および SGMSE+) と比較して、より高速なデコードとより高い PESQ スコアが得られることが実証されました。

要約(オリジナル)

Diffusion-based generative speech enhancement (SE) has recently received attention, but reverse diffusion remains time-consuming. One solution is to initialize the reverse diffusion process with enhanced features estimated by a predictive SE system. However, the pipeline structure currently does not consider for a combined use of generative and predictive decoders. The predictive decoder allows us to use the further complementarity between predictive and diffusion-based generative SE. In this paper, we propose a unified system that use jointly generative and predictive decoders across two levels. The encoder encodes both generative and predictive information at the shared encoding level. At the decoded feature level, we fuse the two decoded features by generative and predictive decoders. Specifically, the two SE modules are fused in the initial and final diffusion steps: the initial fusion initializes the diffusion process with the predictive SE to improve convergence, and the final fusion combines the two complementary SE outputs to enhance SE performance. Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores compared with other score-based diffusion SE (StoRM and SGMSE+).

arxiv情報

著者	Hao Shi,Kazuki Shimada,Masato Hirano,Takashi Shibuya,Yuichiro Koyama,Zhi Zhong,Shusuke Takahashi,Tatsuya Kawahara,Yuki Mitsufuji
発行日	2024-02-28 12:10:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Diffusion-Based Speech Enhancement with Joint Generative and Predictive Decoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー