Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

要約

トレーニングデータクリーニングは、生成モデルベースの音声修復（SR）の新しいアプリケーションです。
このペーパーでは、大規模な言語モデルなどの大規模生成モデルのデータクリーニングをトレーニングするために、100万時間のスケールデータ向けに設計されたSRモデルであるMiipher-2を紹介します。
対処された主要な課題には、目に見えない言語への一般化、明示的な条件付けのない操作（テキスト、スピーカーIDなど）、および計算効率が含まれます。
Miipher-2は、堅牢で訓練された普遍的な音声モデル（USM）を利用し、300を超える言語を堅牢で調整のない機能抽出器としてサポートします。
効率を最適化し、メモリを最小化するために、Miipher-2には、騒々しい入力からクリーンUSM機能を予測するための並列アダプターが組み込まれ、波形合成にWaneFit Neural Vocoderを使用します。
これらのコンポーネントは、3,000時間の多言語のスタジオ品質の録音を拡張することで訓練されましたが、USMパラメーターは固定されたままでした。
実験結果は、ワードエラーレート、スピーカーの類似性、およびテストされたすべての言語での客観的および主観的な音質スコアの両方で、従来のSRモデルにMiipher-2の優れたパフォーマンスまたは同等のパフォーマンスを示しています。
MIIPHER-2は、消費者グレードの加速器で効率的に動作し、0.0078のリアルタイム係数を達成し、そのような加速器100のみを使用して約3日で100万時間の音声データセットの処理を可能にします。

要約(オリジナル)

Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaneFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.

arxiv情報

著者	Shigeki Karita,Yuma Koizumi,Heiga Zen,Haruko Ishikawa,Robin Scheibler,Michiel Bacchiani
発行日	2025-05-07 14:27:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー