Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

要約

この論文では、マルチモーダル音声生成のための新しい技術を動機付けるために、新しいタスク、つまり人々のビデオとそのトランスクリプト (VTTS) から音声を生成することを提案します。
このタスクは、唇を切り取ったビデオから音声を生成するタスクを一般化するものですが、ビデオやテキストから一般的なオーディオクリップ (犬の鳴き声など) を生成するタスクよりも複雑です。
タスクの多言語バージョンは、言語をまたいだ吹き替えのための新しい技術につながる可能性があります。
また、このタスクのためのデコーダのみのマルチモーダルモデル (Visatronic と呼ばれる) も提示します。
このモデルは、視覚、テキスト、および音声をトランスフォーマーモデルの共通部分空間に直接埋め込み、自己回帰損失を使用して、話者のビデオと音声のトランスクリプトに条件付けされた離散化メルスペクトログラムの生成モデルを学習します。
すべてのモダリティを共通のサブスペースに埋め込むことで、Visatronic は入力としてテキストまたはビデオのみを使用するモデルよりも優れた結果を達成できます。
さらに、より良い結果を生み出しながらモダリティを融合するために唇検出器と複雑なアーキテクチャに依存する一般的なアプローチと比較して、マルチモーダル音声生成のためのはるかに単純なアプローチを提供します。
このモデルは、シーケンスとして入力を順序付けるさまざまな方法に対応できる十分な柔軟性を備えているため、情報を生成ステップに伝播する最適な方法をよりよく理解するために、さまざまな戦略を慎重に検討します。
VTTS に関するさらなる研究を促進するために、(i) コード、(ii) 大規模 VoxCeleb2 データセットのクリーンな転写、および (iii) 客観的指標と主観的指標の両方を組み込んだ VTTS の標準化された評価プロトコルをリリースします。

要約(オリジナル)

In this paper, we propose a new task — generating speech from videos of people and their transcripts (VTTS) — to motivate new techniques for multimodal speech generation. This task generalizes the task of generating speech from cropped lip videos, and is also more complicated than the task of generating generic audio clips (e.g., dog barking) from videos and text. Multilingual versions of the task could lead to new techniques for cross-lingual dubbing. We also present a decoder-only multimodal model for this task, which we call Visatronic. This model embeds vision, text and speech directly into the common subspace of a transformer model and uses an autoregressive loss to learn a generative model of discretized mel-spectrograms conditioned on speaker videos and transcripts of their speech. By embedding all modalities into a common subspace, Visatronic can achieve improved results over models that use only text or video as input. Further, it presents a much simpler approach for multimodal speech generation compared to prevailing approaches which rely on lip-detectors and complicated architectures to fuse modalities while producing better results. Since the model is flexible enough to accommodate different ways of ordering inputs as a sequence, we carefully explore different strategies to better understand the best way to propagate information to the generative steps. To facilitate further research on VTTS, we will release (i) our code, (ii) clean transcriptions for the large-scale VoxCeleb2 dataset, and (iii) a standardized evaluation protocol for VTTS incorporating both objective and subjective metrics.

arxiv情報

著者	Akshita Gupta,Tatiana Likhomanenko,Karren Dai Yang,Richard He Bai,Zakaria Aldeneh,Navdeep Jaitly
発行日	2024-11-26 18:57:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー