VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

要約

最近のマルチモーダル大規模言語モデル (MLLM) は通常、視覚的モダリティとテキストモダリティの統合に焦点を当てており、インタラクションを強化する際の音声の役割にはそれほど重点が置かれていません。
ただし、マルチモーダル対話システムでは音声が重要な役割を果たしており、基本的なモダリティの違いにより、視覚タスクと音声タスクの両方で高性能を実装することは依然として大きな課題です。
この論文では、視覚情報と音声情報の両方を理解できるように LLM を段階的にトレーニングし、最終的に視覚と音声の流暢なインタラクションを可能にする、慎重に設計された多段階トレーニング方法論を提案します。
私たちのアプローチは、強力な視覚言語能力を維持するだけでなく、個別の ASR モジュールや TTS モジュールを必要とせずに効率的な音声対音声対話機能を可能にし、マルチモーダルなエンドツーエンドの応答速度を大幅に加速します。
画像、ビデオ、および音声タスクのベンチマーク全体で、私たちのメソッドを最先端の対応するメソッドと比較することにより、私たちのモデルが強力な視覚機能と音声機能の両方を備えており、ほぼリアルタイムの視覚と音声のインタラクションを実現していることを実証します。

要約(オリジナル)

Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

arxiv情報

著者	Chaoyou Fu,Haojia Lin,Xiong Wang,Yi-Fan Zhang,Yunhang Shen,Xiaoyu Liu,Haoyu Cao,Zuwei Long,Heting Gao,Ke Li,Long Ma,Xiawu Zheng,Rongrong Ji,Xing Sun,Caifeng Shan,Ran He
発行日	2025-01-21 15:36:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー