Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

要約

音声品質評価では通常、平均オピニオンスコア(MOS)や話者類似度(SIM)等の複数の側面から音声を評価する必要があり、単一のタスクのために設計された1つの小さなモデルでカバーすることは困難である。本論文では、最近導入された聴覚大規模言語モデル（LLM）を活用した自動音声品質評価を提案する。タスクに特化したプロンプトを採用することで、聴覚LLMを微調整し、音声合成システムの評価によく使われるMOS、SIM、A/Bテストの結果を予測する。さらに、微調整された聴覚LLMは、雑音、歪み、不連続性、全体的な品質などの側面を評価する自然言語記述を生成することができ、より解釈しやすい出力を提供する。SALMONN、Qwen-Audio、Qwen2-Audioなどのオープンソースの聴覚LLMを用いて、NISQA、BVCC、SOMOS、VoxSimの各音質データセットで広範な実験を行った。自然言語記述タスクについては、商用モデルのGoogle Gemini 1.5 Proも評価した。その結果、聴覚LLMは、MOSとSIMの予測において、最新のタスク固有小型モデルと比較して競争力のある性能を達成し、A/Bテストと自然言語記述においても有望な結果をもたらすことが実証された。我々のデータ処理スクリプトと微調整されたモデルのチェックポイントは、https://github.com/bytedance/SALMONN。

要約(オリジナル)

Speech quality assessment typically requires evaluating audio from multiple aspects, such as mean opinion score (MOS) and speaker similarity (SIM) \etc., which can be challenging to cover using one small model designed for a single task. In this paper, we propose leveraging recently introduced auditory large language models (LLMs) for automatic speech quality assessment. By employing task-specific prompts, auditory LLMs are finetuned to predict MOS, SIM and A/B testing results, which are commonly used for evaluating text-to-speech systems. Additionally, the finetuned auditory LLM is able to generate natural language descriptions assessing aspects like noisiness, distortion, discontinuity, and overall quality, providing more interpretable outputs. Extensive experiments have been performed on the NISQA, BVCC, SOMOS and VoxSim speech quality datasets, using open-source auditory LLMs such as SALMONN, Qwen-Audio, and Qwen2-Audio. For the natural language descriptions task, a commercial model Google Gemini 1.5 Pro is also evaluated. The results demonstrate that auditory LLMs achieve competitive performance compared to state-of-the-art task-specific small models in predicting MOS and SIM, while also delivering promising results in A/B testing and natural language descriptions. Our data processing scripts and finetuned model checkpoints can be found at https://github.com/bytedance/SALMONN.

arxiv情報

著者	Siyin Wang,Wenyi Yu,Yudong Yang,Changli Tang,Yixuan Li,Jimin Zhuang,Xianzhao Chen,Xiaohai Tian,Jun Zhang,Guangzhi Sun,Lu Lu,Chao Zhang
発行日	2025-03-03 07:22:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Enabling Auditory Large Language Models for Automatic Speech Quality Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー