EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

要約

GPT-4Oは、多様な感情やトーンとのボーカルな会話を可能にするオムニモーダルモデルであり、オムニモーダルファンデーションモデルのマイルストーンを示しています。
ただし、大規模な言語モデルに、公開されたデータを使用して画像、テキスト、およびエンドツーエンドを認識して生成できるようにすることは、オープンソースコミュニティにとって依然として挑戦的です。
既存のビジョン言語モデルは、音声処理のために外部ツールに依存していますが、音声言語モデルは依然として視覚的理解能力が限られているか、まったくない能力があります。
このギャップに対処するために、EMOVA（感情的にオムニプレゼントの音声アシスタント）を提案し、主要なビジョン言語パフォーマンスを維持しながら、エンドツーエンドの音声能力を備えた大規模な言語モデルを可能にします。
セマンティック音響の解き伸びたスピーチトークネイザーを使用すると、驚くべきことに、オムニモーダルのアライメントは、バイモーダルの整列した対応物と比較して視覚言語と音声能力をさらに強化できることに気付きます。
さらに、感情やピッチなどの柔軟なスピーチスタイルコントロール用に軽量スタイルモジュールが導入されています。
EMOVAは、ビジョン言語と音声ベンチマークの両方で最先端のパフォーマンスを達成し、その間、鮮やかな感情とのオムニモーダルの話し言葉の対話をサポートしています。

要約(オリジナル)

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

arxiv情報

著者	Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Jun Yao,Lanqing Hong,Lu Hou,Hang Xu
発行日	2025-03-13 14:51:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー