Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

要約

ターゲット話者の音声アクティビティ検出 (TS-VAD) は、オーディオフレーム内の既知のターゲット話者からの音声の存在を検出するタスクです。
最近、ディープニューラルネットワークベースのモデルがこのタスクで優れたパフォーマンスを示しています。
ただし、これらのモデルのトレーニングには大規模なラベル付きデータが必要であり、特に目に見えない環境への一般化が重要な場合には、取得にコストと時間がかかります。
これを軽減するために、ノイズの多い状況で TS-VAD のパフォーマンスを向上させるために、ノイズ除去自己回帰予測コーディング (DN-APC) と呼ばれる因果的な自己教師あり学習 (SSL) 事前トレーニングフレームワークを提案します。
また、さまざまなスピーカー調整方法を検討し、さまざまな騒音条件下でのパフォーマンスを評価します。
私たちの実験では、DN-APC がノイズの多い環境でのパフォーマンスを向上させ、一般的に約 100 パーセントのパフォーマンスが向上することを示しています。
目に見えるノイズと目に見えないノイズの両方で 2%。
さらに、FiLM コンディショニングが全体的に最高のパフォーマンスを提供することがわかりました。
tSNE プロットによる表現分析により、事前トレーニングからの音声と非音声の堅牢な初期表現が明らかになります。
これは、騒音の多い環境における TS-VAD モデルの堅牢性とパフォーマンスの向上における SSL 事前トレーニングの有効性を強調しています。

要約(オリジナル)

Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.

arxiv情報

著者	Holger Severin Bovbjerg,Jan Østergaard,Jesper Jensen,Zheng-Hua Tan
発行日	2025-01-06 18:00:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー