EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

要約

GPT-4o は、多様な感情やトーンの音声会話を可能にするオムニモーダルモデルで、オムニモーダルファウンデーションモデルのマイルストーンとなります。
ただし、オープンソースコミュニティでは、大規模言語モデルが画像、テキスト、音声をエンドツーエンドで認識し、公開データを使用して生成できるようにすることは依然として困難です。
既存の視覚言語モデルは音声処理の外部ツールに依存していますが、音声言語モデルには依然として視覚理解能力が限られているか、視覚理解能力がないという問題があります。
このギャップに対処するために、最先端のビジョン言語パフォーマンスを維持しながら、エンドツーエンドの音声機能を備えた大規模言語モデルを可能にする EMOVA (EMotionally Omni-present Voice Assistant) を提案します。
意味音響分解音声トークナイザーを使用すると、驚くべきことに、オムニモーダル調整により、対応するバイモーダル調整の対応物と比較して、視覚言語と音声の能力がさらに向上することがわかります。
さらに、柔軟な音声スタイル制御 (感情やピッチなど) のために、軽量スタイルモジュールが提案されています。
EMOVA は初めて、視覚言語と音声ベンチマークの両方で最先端のパフォーマンスを達成し、同時に、生き生きとした感情を伴うオムニモーダルな音声対話をサポートします。

要約(オリジナル)

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging in the open-source community. Existing vision-language models rely on external tools for the speech processing, while speech-language models still suffer from limited or even without vision-understanding abilities. To address this gap, we propose EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech capabilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we notice surprisingly that omni-modal alignment can further enhance vision-language and speech abilities compared with the corresponding bi-modal aligned counterparts. Moreover, a lightweight style module is proposed for flexible speech style controls (e.g., emotions and pitches). For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

arxiv情報

著者	Kai Chen,Yunhao Gou,Runhui Huang,Zhili Liu,Daxin Tan,Jing Xu,Chunwei Wang,Yi Zhu,Yihan Zeng,Kuo Yang,Dingdong Wang,Kun Xiang,Haoyuan Li,Haoli Bai,Jianhua Han,Xiaohui Li,Weike Jin,Nian Xie,Yu Zhang,James T. Kwok,Hengshuang Zhao,Xiaodan Liang,Dit-Yan Yeung,Xiao Chen,Zhenguo Li,Wei Zhang,Qun Liu,Lanqing Hong,Lu Hou,Hang Xu
発行日	2024-09-26 16:44:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー