On Data Sampling Strategies for Training Neural Network Speech Separation Models

要約

音声分離は依然としてマルチスピーカー信号処理の重要な領域です。
ディープニューラルネットワーク (DNN) モデルは、多くの音声分離ベンチマークで最高のパフォーマンスを達成しています。
これらのモデルの中には、トレーニングにかなりの時間がかかり、高いメモリ要件が必要になる場合があります。
以前の研究では、これらの問題に対処するためにトレーニング例を短縮することが提案されていますが、これがモデルのパフォーマンスに与える影響はまだよく理解されていません。
この研究では、これらのトレーニング信号長 (TSL) 制限を適用した場合の影響を、2 つの音声分離モデル (変換モデルである SepFormer と畳み込みモデルである Conv-TasNet) に対して分析します。
WJS0-2Mix、WHAMR、Libri2Mix データセットは、信号長分布とトレーニング効率への影響の観点から分析されます。
特定のディストリビューションでは、特定の TSL 制限を適用するとパフォーマンスが向上することが実証されています。
これは主に、波形の開始インデックスをランダムにサンプリングすることで、トレーニング用のより固有の例が得られることが原因であることが示されています。
4.42 秒の TSL 制限とダイナミックミキシング (DM) を使用してトレーニングされた SepFormer モデルは、DM および無制限の信号長でトレーニングされた最高のパフォーマンスの SepFormer モデルと一致することが示されています。
さらに、4.42 秒の TSL 制限により、WHAMR によるトレーニング時間は 44% 削減されます。

要約(オリジナル)

Speech separation remains an important area of multi-speaker signal processing. Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks. Some of these models can take significant time to train and have high memory requirements. Previous work has proposed shortening training examples to address these issues but the impact of this on model performance is not yet well understood. In this work, the impact of applying these training signal length (TSL) limits is analysed for two speech separation models: SepFormer, a transformer model, and Conv-TasNet, a convolutional model. The WJS0-2Mix, WHAMR and Libri2Mix datasets are analysed in terms of signal length distribution and its impact on training efficiency. It is demonstrated that, for specific distributions, applying specific TSL limits results in better performance. This is shown to be mainly due to randomly sampling the start index of the waveforms resulting in more unique examples for training. A SepFormer model trained using a TSL limit of 4.42s and dynamic mixing (DM) is shown to match the best-performing SepFormer model trained with DM and unlimited signal lengths. Furthermore, the 4.42s TSL limit results in a 44% reduction in training time with WHAMR.

arxiv情報

著者	William Ravenscroft,Stefan Goetze,Thomas Hain
発行日	2023-06-16 13:42:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On Data Sampling Strategies for Training Neural Network Speech Separation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー