VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

要約

大規模な言語モデル（LLMS）の急速な進歩により、ボーカルコミュニケーションが可能なマルチモーダルモデルの開発が加速されました。
テキストベースの相互作用とは異なり、スピーチは、セマンティックコンテンツ、音響のバリエーション、パラナングラングのキュー、環境コンテキストなど、豊かで多様な情報を伝えます。
ただし、音声相互作用モデルの既存の評価は、主にテキスト応答の品質に焦点を当てており、多くの場合、ボーカルパフォーマンスの重要な側面を見落とし、ボーカル固有のテストインスタンスでベンチマークを欠いています。
このギャップに対処するために、ボーカルコミュニケーションにおける音声相互作用モデルの機能を評価するために設計された包括的なベンチマークであるVocalbenchを提案します。
Vocalbenchは、セマンティック品質、音響性能、会話能力、堅牢性の4つの重要な次元にわたって9,400の慎重にキュレーションされたインスタンスで構成されています。
効果的なボーカル相互作用に不可欠な16の基本的なスキルをカバーしています。
実験結果は、現在のモデル能力の大幅な変動性を明らかにしており、それぞれが明確な長所と短所を示し、音声ベースの相互作用システムの将来の研究を導く貴重な洞察を提供します。
コードおよび評価インスタンスは、https：//github.com/sjtu-omniagent/vocalbenchで入手できます。

要約(オリジナル)

The rapid advancement of large language models (LLMs) has accelerated the development of multi-modal models capable of vocal communication. Unlike text-based interactions, speech conveys rich and diverse information, including semantic content, acoustic variations, paralanguage cues, and environmental context. However, existing evaluations of speech interaction models predominantly focus on the quality of their textual responses, often overlooking critical aspects of vocal performance and lacking benchmarks with vocal-specific test instances. To address this gap, we propose VocalBench, a comprehensive benchmark designed to evaluate speech interaction models’ capabilities in vocal communication. VocalBench comprises 9,400 carefully curated instances across four key dimensions: semantic quality, acoustic performance, conversational abilities, and robustness. It covers 16 fundamental skills essential for effective vocal interaction. Experimental results reveal significant variability in current model capabilities, each exhibiting distinct strengths and weaknesses, and provide valuable insights to guide future research in speech-based interaction systems. Code and evaluation instances are available at https://github.com/SJTU-OmniAgent/VocalBench.

arxiv情報

著者	Heyang Liu,Yuhao Wang,Ziyang Cheng,Ronghua Wu,Qunshan Gu,Yanfeng Wang,Yu Wang
発行日	2025-05-21 16:34:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VocalBench: Benchmarking the Vocal Conversational Abilities for Speech Interaction Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー