Whisper Finetuning on Nepali Language

要約

自動音声認識 (ASR) モデルはますます進歩していますが、ネパール語など過小評価されている言語に対する堅牢なモデルの開発は依然として課題です。
この研究は、網羅的で一般化されたデータセットを作成し、その後、さまざまなサイズの OpenAI の Whisper モデルを微調整して、ネパール語の転写 (音声からテキストへ) の精度を向上させることに焦点を当てています。
私たちは、公開されている ASR データセットと、拡張を通じてさらに強化された、多様なアクセント、方言、話し方を含む自己記録されたカスタムデータセットを活用しています。
私たちの実験結果は、厳選されたカスタムデータセットでウィスパーモデルを微調整すると、話者の年齢、性別、感情、音響環境、方言、密度などの点でデータのばらつきが大きくなるため、すべてのモデルサイズにわたって単語誤り率 (WER) が大幅に減少することを示しています。
Whisper の入力との互換性が高いオーディオセグメント (15 ～ 30 秒)、およびオーディオと文字起こしの手動キュレーション。
特に、私たちのアプローチは Fleur のデータセットでトレーニングされた Whisper のベースラインモデルよりも優れており、小規模モデルで最大 36.2%、中型モデルで最大 23.8% の WER 削減を達成しています。
さらに、データ拡張がモデルの堅牢性を高める上で重要な役割を果たすことを示します。
私たちのアプローチは、正確な ASR システムを開発するために、過小評価されている言語に最先端のモデルを適応させる際のデータセットの品質、バリエーション、拡張の重要性を強調しています。

要約(オリジナル)

Despite the growing advancements in Automatic Speech Recognition (ASR) models, the development of robust models for underrepresented languages, such as Nepali, remains a challenge. This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI’s Whisper models of different sizes to improve transcription (speech-to-text) accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our experimental results demonstrate that fine-tuning Whisper models on our curated custom dataset substantially reduces the Word Error Rate (WER) across all model sizes attributed to larger data variations in terms of speaker’s age, gender, and sentiment, acoustic environment, dialect, denser audio segments (15-30 seconds) that are more compatible with Whisper’s input, and manual curation of audios and transcriptions. Notably, our approach outperforms Whisper’s baseline models trained on Fleur’s dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models. Furthermore, we show that data augmentation plays a significant role in enhancing model robustness. Our approach underlines the importance of dataset quality, variation, and augmentation in the adaptation of state-of-the-art models to underrepresented languages for developing accurate ASR systems.

arxiv情報

著者	Sanjay Rijal,Shital Adhikari,Manish Dahal,Manish Awale,Vaghawan Ojha
発行日	2024-11-19 15:55:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Whisper Finetuning on Nepali Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー