Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC

要約

教師ありまたは教師ありで事前に学習された音声基礎モデル（SFM）を用いた多言語音声処理は、言語識別（LID）や自動音声認識（ASR）のようなタスクで高い性能を達成している。しかし、これらのモデルは微調整の際に限られたリソースで苦労している。本稿では、凍結された上流学習、部分的な微調整、低ランク適応を含む、SFMを適応させるための複数の戦略を探求することで、ML-SUPERB 2.0上での多言語LIDとASRを強化する。さらに、少数ショットの設定におけるパフォーマンスギャップを緩和するためにデータ増強を採用し、正則化のためにLID Connectionist Temporal Classification (CTC)損失を導入する。我々のアプローチは、ML-SUPERB 2.0のベースラインと比較して、LID精度で14％の相対的な改善、ASR CERで30％の相対的な削減を達成し、Interspeech 2025 ML-SUPERB 2.0チャレンジで2位を獲得した。

要約(オリジナル)

Multilingual speech processing with self-supervised or supervised pre-trained Speech Foundation Models (SFM) has achieved strong performance on tasks like Language Identification (LID) and Automatic Speech Recognition (ASR). However, these models struggle with limited resources during fine-tuning. This paper enhances multilingual LID and ASR on ML-SUPERB 2.0 by exploring multiple strategies for adapting SFMs, including frozen upstream training, partial fine-tuning, and low-rank adaptation. Furthermore, we employ data augmentation to mitigate performance gaps in few-shot settings and introduce LID Connectionist Temporal Classification (CTC) loss for regularization. Our approach achieves a 14% relative improvement in LID accuracy and a 30% relative reduction in ASR CER over the baseline on ML-SUPERB 2.0, securing second place in the Interspeech 2025 ML-SUPERB 2.0 Challenge.

arxiv情報

著者	Qingzheng Wang,Jiancheng Sun,Yifan Peng,Shinji Watanabe
発行日	2025-06-03 15:19:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Improving Multilingual Speech Models on ML-SUPERB 2.0: Fine-tuning with Data Augmentation and LID-Aware CTC

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー