Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

要約

音声品質評価では、通常、平均意見スコア（MOS）やスピーカーの類似性（SIM）\などの複数の側面からのオーディオを評価する必要があります。
この論文では、自動音声品質評価のために最近導入された聴覚大規模な言語モデル（LLM）を最近導入することを提案します。
タスク固有のプロンプトを採用することにより、聴覚LLMは、テキスト間システムの評価に一般的に使用されるMO、SIM、A/Bテストの結果を予測するために微調整されます。
さらに、Finetuned聴覚LLMは、騒音、歪み、不連続性、全体的な品質などの側面を評価する自然言語の説明を生成し、より解釈可能な出力を提供することができます。
Salmonn、Qwen-Audio、Qwen2-Audioなどのオープンソース聴覚LLMを使用して、NISQA、BVCC、SOMOS、およびVOXSIM音声品質データセットで広範な実験が行われました。
自然言語の説明タスクでは、商用モデルのGoogle Gemini 1.5 Proも評価されています。
結果は、聴覚LLMがMOとSIMの予測において最先端のタスク固有の小さなモデルと比較して競争力のあるパフォーマンスを達成すると同時に、A/Bテストと自然言語の説明で有望な結果をもたらすことを示しています。
データ処理スクリプトとFinetunedモデルチェックポイントは、https：//github.com/bytedance/salmonnにあります。

要約(オリジナル)

Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) \etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints can be found at https://github.com/bytedance/SALMONN.

arxiv情報

著者	Siyin Wang,Wenyi Yu,Yudong Yang,Changli Tang,Yixuan Li,Jimin Zhuang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Yuxuan Wang,Chao Zhang
発行日	2025-04-01 12:35:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー