Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

要約

wav2vec や HuBERT などの自己教師あり学習 (SSL) 音声モデルは、自動音声認識 (ASR) で最先端のパフォーマンスを実証し、ラベルリソースが少ない設定で非常に役立つことが証明されています。
ただし、SSL モデルの成功は、話者、感情、言語認識などの発話レベルのタスクにはまだ移行しておらず、良好なパフォーマンスを得るには依然として SSL モデルの教師付き微調整が必要です。
私たちは、この問題は、もつれを解く表現と、これらのタスクに対する発話レベルの学習目標の欠如によって引き起こされると主張します。
HuBERT がクラスタリングを使用して隠れた音響ユニットを発見する方法に触発され、発見された隠れた音響ユニットを使用して SSL 特徴を調整する因子分析 (FA) モデルを定式化します。
基になる発話レベルの表現は、位置合わせされた特徴に対する確率的推論を使用して、音声の内容から解きほぐされます。
さらに、FA モデルから導出された変分下限は発話レベルの目標を提供し、誤差勾配を Transformer 層に逆伝播して、高度に識別可能な音響単位を学習できるようにします。
HuBERT のマスクされた予測トレーニングと組み合わせて使用すると、ラベル付きデータがわずか 20% である SUPERB ベンチマークのすべての発話レベルの非意味論的タスクにおいて、当社のモデルは現在最高のモデルである WavLM を上回ります。

要約(オリジナル)

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled from the content of speech using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT’s masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.

arxiv情報

著者	Weiwei Lin,Chenhang He,Man-Wai Mak,Youzhi Tu
発行日	2023-10-04 12:15:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー