Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

要約

大規模な言語モデル（LLMS）の最近の進歩により、テキストからスピーチ（TTS）システムが大幅に改善され、音声スタイル、自然さ、および感情表現の制御が強化され、TTSシステムが人間レベルのパフォーマンスに近づきます。
平均意見スコア（MOS）はTTSシステム評価の標準であり続けていますが、主観性、環境的矛盾、および制限された解釈可能性に悩まされています。
既存の評価データセットには多次元設計も欠けており、多くの場合、スピーキングスタイル、コンテキストの多様性、トラップ発話などの要因を無視しています。これは、中国のTTS評価で特に顕著です。
これらの課題に対処するために、単純なチューリングテストにインスパイアされた評価プロトコルと組み合わせた多次元中国のコーパスデータセットatt-corpusであるオーディオチューリングテスト（ATT）を紹介します。
複雑なMOSスケールや直接モデルの比較に依存する代わりに、ATTは評価者に声が人間に聞こえるかどうかを判断するよう求めます。
この単純化により、評価バイアスが減少し、評価の堅牢性が向上します。
迅速なモデル開発をさらにサポートするために、自動評価のために自動アットとして、人間の判断データを使用してQWEN2-AUDIO-INTRUCTを獲得します。
実験結果は、ATTが多次元設計を使用して特定の機能ディメンション全体でモデルを効果的に区別することを示しています。
また、Auto-attは人間の評価との強い整合性を示し、その価値を高速で信頼できる評価ツールとして確認します。
ホワイトボックスアトコルパスとオートアットは、ATT Hugging Face Collection（https://huggingface.co/collections/meituan/audio-turing-test-6824446320368164faeaf38a4）にあります。

要約(オリジナル)

Recent advances in large language models (LLMs) have significantly improved text-to-speech (TTS) systems, enhancing control over speech style, naturalness, and emotional expression, which brings TTS Systems closer to human-level performance. Although the Mean Opinion Score (MOS) remains the standard for TTS System evaluation, it suffers from subjectivity, environmental inconsistencies, and limited interpretability. Existing evaluation datasets also lack a multi-dimensional design, often neglecting factors such as speaking styles, context diversity, and trap utterances, which is particularly evident in Chinese TTS evaluation. To address these challenges, we introduce the Audio Turing Test (ATT), a multi-dimensional Chinese corpus dataset ATT-Corpus paired with a simple, Turing-Test-inspired evaluation protocol. Instead of relying on complex MOS scales or direct model comparisons, ATT asks evaluators to judge whether a voice sounds human. This simplification reduces rating bias and improves evaluation robustness. To further support rapid model development, we also finetune Qwen2-Audio-Instruct with human judgment data as Auto-ATT for automatic evaluation. Experimental results show that ATT effectively differentiates models across specific capability dimensions using its multi-dimensional design. Auto-ATT also demonstrates strong alignment with human evaluations, confirming its value as a fast and reliable assessment tool. The white-box ATT-Corpus and Auto-ATT can be found in ATT Hugging Face Collection (https://huggingface.co/collections/meituan/audio-turing-test-682446320368164faeaf38a4).

arxiv情報

著者	Xihuai Wang,Ziyi Zhao,Siyu Ren,Shao Zhang,Song Li,Xiaoyu Li,Ziwen Wang,Lin Qiu,Guanglu Wan,Xuezhi Cao,Xunliang Cai,Weinan Zhang
発行日	2025-05-16 12:57:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Audio Turing Test: Benchmarking the Human-likeness of Large Language Model-based Text-to-Speech Systems in Chinese

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー