Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

要約

音響単語埋め込み (AWE) は、話し言葉のベクトル表現です。
AWE を取得する効果的な方法は、Correspondence Auto-Encoder (CAE) です。
これまで、CAE 手法は従来の MFCC 機能と関連付けられてきました。
HuBERT、Wav2vec2 などの自己教師あり学習 (SSL) ベースの音声モデルから得られた表現は、多くの下流タスクで MFCC を上回ります。
ただし、AWE の学習という観点では、これらは十分に研究されていません。
この研究では、改善された AWE を得るために、SSL ベースの音声表現を使用した CAE の有効性を調査します。
さらに、SSL ベースの音声モデルの機能が、AWE を取得するための言語を超えたシナリオで調査されます。
実験はポーランド語、ポルトガル語、スペイン語、フランス語、英語の 5 つの言語で行われます。
HuBERT ベースの CAE モデルは、Hu-BERT が英語のみで事前トレーニングされているにもかかわらず、すべての言語で単語識別に関して最高の結果を達成します。
また、HuBERT ベースの CAE モデルは、言語をまたいだ設定でもうまく機能します。
1 つのソース言語でトレーニングし、ターゲット言語でテストした場合、ターゲット言語でトレーニングされた MFCC ベースの CAE モデルよりも優れたパフォーマンスを発揮します。

要約(オリジナル)

Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features. Representations obtained from self-supervised learning (SSL)-based speech models such as HuBERT, Wav2vec2, etc., are outperforming MFCC in many downstream tasks. However, they have not been well studied in the context of learning AWEs. This work explores the effectiveness of CAE with SSL-based speech representations to obtain improved AWEs. Additionally, the capabilities of SSL-based speech models are explored in cross-lingual scenarios for obtaining AWEs. Experiments are conducted on five languages: Polish, Portuguese, Spanish, French, and English. HuBERT-based CAE model achieves the best results for word discrimination in all languages, despite Hu-BERT being pre-trained on English only. Also, the HuBERT-based CAE model works well in cross-lingual settings. It outperforms MFCC-based CAE models trained on the target languages when trained on one source language and tested on target languages.

arxiv情報

著者	Amit Meghanani,Thomas Hain
発行日	2024-03-13 17:42:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー