Evaluation of Speech Representations for MOS prediction

要約

この論文では、音声品質を予測するための特徴抽出モデルを評価します。
また、教師あり学習モデルおよび自己教師あり学習モデルの埋め込みと話者検証モデルの埋め込みを比較してメトリクス MOS を予測するモデルアーキテクチャも提案します。
私たちの実験は、VCC2018 データセットと、この作業用に作成された BRSpeechMOS と呼ばれるブラジル・ポルトガル語データセットで実行されました。
結果は、Whisper モデルが VCC2018 データセットと BRSpeech-MOS データセットの両方を使用するすべてのシナリオで適切であることを示しています。
BRSpeechMOS を使用した教師あり学習モデルと自己教師あり学習モデルの中で、Whisper-Small は 0.6980 という最良の線形相関を達成し、話者検証モデル SpeakerNet の線形相関は 0.6963 でした。
VCC2018 を使用すると、最良の教師ありおよび自己教師あり学習モデルである Whisper-Large は 0.7274 の線形相関を達成し、最良の話者検証モデルである TitaNet は 0.6933 の線形相関を達成しました。
話者検証モデルの結果はわずかに低くなりますが、SpeakerNet モデルのパラメータは 5M のみで、リアルタイムアプリケーションに適しており、TitaNet モデルは、すべての評価モデルの中で最小のサイズ 192 のエンベディングを生成します。
実験結果は、公開されているソースコード1で再現可能です。

要約(オリジナル)

In this paper, we evaluate feature extraction models for predicting speech quality. We also propose a model architecture to compare embeddings of supervised learning and self-supervised learning models with embeddings of speaker verification models to predict the metric MOS. Our experiments were performed on the VCC2018 dataset and a Brazilian-Portuguese dataset called BRSpeechMOS, which was created for this work. The results show that the Whisper model is appropriate in all scenarios: with both the VCC2018 and BRSpeech- MOS datasets. Among the supervised and self-supervised learning models using BRSpeechMOS, Whisper-Small achieved the best linear correlation of 0.6980, and the speaker verification model, SpeakerNet, had linear correlation of 0.6963. Using VCC2018, the best supervised and self-supervised learning model, Whisper-Large, achieved linear correlation of 0.7274, and the best model speaker verification, TitaNet, achieved a linear correlation of 0.6933. Although the results of the speaker verification models are slightly lower, the SpeakerNet model has only 5M parameters, making it suitable for real-time applications, and the TitaNet model produces an embedding of size 192, the smallest among all the evaluated models. The experiment results are reproducible with publicly available source-code1 .

arxiv情報

著者	Frederico S. Oliveira,Edresson Casanova,Arnaldo Cândido Júnior,Lucas R. S. Gris,Anderson S. Soares,Arlindo R. Galvão Filho
発行日	2023-06-16 17:21:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Evaluation of Speech Representations for MOS prediction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー