Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

要約

日常生活にシームレスに溶け込む音声AIエージェントは、自律的で、リアルタイムで、感情表現豊かな方法で人間と対話する。単にコマンドに反応するのではなく、絶えず耳を傾け、推論し、主体的に反応することで、流動的でダイナミック、かつ感情的に共鳴するインタラクションを育むだろう。このビジョンへの一歩を踏み出す、大規模な音声言語基盤モデル・ファミリーであるVoilaを紹介する。Voilaは、トーン、リズム、感情などの豊かなボーカルのニュアンスを保持しながら、全二重、低遅延の会話を可能にする新しいエンドツーエンド・アーキテクチャを採用することで、従来のパイプライン・システムを超えています。人間の平均応答時間を上回る、わずか195ミリ秒の応答レイテンシーを達成。その階層型マルチスケールトランスフォーマーは、大規模言語モデル（LLM）の推論能力と強力な音響モデリングを統合し、自然でペルソナを意識した音声生成を可能にします。さらに、Voilaは100万以上のプリビルド音声をサポートしており、10秒という短い音声サンプルから新しい音声を効率的にカスタマイズすることができます。音声ダイアログにとどまらず、Voilaは自動音声認識(ASR)、テキスト音声合成(TTS)、最小限の適応で多言語音声翻訳など、幅広い音声ベースのアプリケーションのための統一モデルとして設計されています。Voilaは、オープンな研究をサポートし、次世代の人間と機械の相互作用に向けた進歩を加速するために、完全にオープンソース化されています。

要約(オリジナル)

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation — where users can simply write text instructions to define the speaker’s identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

arxiv情報

著者	Yemin Shi,Yu Shu,Siwei Dong,Guangyi Liu,Jaward Sesay,Jingwen Li,Zhiting Hu
発行日	2025-05-05 15:05:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー