Investigating the Emergent Audio Classification Ability of ASR Foundation Models

要約

テキストおよびビジョン基盤モデルは、ゼロショット設定で多くのタスクを実行できます。これは、これらのシステムを一般的な低リソース設定に適用できる望ましい特性です。
ただし、ASR 基盤モデルのゼロショット機能に関する研究ははるかに少なく、これらのシステムは通常、特定のタスクに合わせて微調整されるか、トレーニング基準とデータアノテーションに一致するアプリケーションに制限されます。
この研究では、主に音声認識用にトレーニングされた ASR 基礎モデルである Whisper と MMS がゼロショット音声分類を実行する能力を調査します。
デコーダでは単純なテンプレートベースのテキストプロンプトを使用し、結果として得られるデコード確率を使用してゼロショット予測を生成します。
追加のデータでモデルをトレーニングしたり、新しいパラメーターを追加したりすることなく、Whisper が 8 つの音声分類データセットの範囲で有望なゼロショット分類パフォーマンスを示し、既存の最先端のゼロショットベースラインの精度を上回ることを実証します。
平均9％でした。
創発的能力を解放するための重要なステップの 1 つはバイアス解除であり、クラス確率の単純な教師なし再重み付け方法により、一貫して大幅なパフォーマンス向上が得られます。
さらに、モデルのサイズに応じてパフォーマンスが向上することを示し、ASR 基礎モデルがスケールアップするにつれてゼロショットパフォーマンスが向上する可能性があることを示唆しています。

要約(オリジナル)

Text and vision foundation models can perform many tasks in a zero-shot setting, a desirable property that enables these systems to be applied in general and low-resource settings. There has been far less work, however, on the zero-shot abilities of ASR foundation models, with these systems typically fine-tuned to specific tasks or constrained to applications that match their training criterion and data annotation. In this work we investigate the ability of Whisper and MMS, ASR foundation models trained primarily for speech recognition, to perform zero-shot audio classification. We use simple template-based text prompts at the decoder and use the resulting decoding probabilities to generate zero-shot predictions. Without training the model on extra data or adding any new parameters, we demonstrate that Whisper shows promising zero-shot classification performance on a range of 8 audio-classification datasets, outperforming the accuracy of existing state-of-the-art zero-shot baselines by an average of 9%. One important step to unlock the emergent ability is debiasing, where a simple unsupervised reweighting method of the class probabilities yields consistent significant performance gains. We further show that performance increases with model size, implying that as ASR foundation models scale up, they may exhibit improved zero-shot performance.

arxiv情報

著者	Rao Ma,Adian Liusie,Mark J. F. Gales,Kate M. Knill
発行日	2024-03-28 16:31:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Investigating the Emergent Audio Classification Ability of ASR Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー