MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

要約

サウンディングビデオ生成 (SVG) は、高次元の信号空間、異なるデータ形式、およびコンテンツ情報のさまざまなパターンが課題となる、オーディオとビデオの共同生成タスクです。
これらの問題に対処するために、SVG タスクに新しいマルチモーダル潜在拡散モデル (MM-LDM) を導入します。
まず、オーディオデータとビデオデータを 1 つまたはいくつかの画像に変換することで、その表現を統合します。
次に、各モダリティの低レベルの知覚潜在空間と共有の高レベルの意味特徴空間を構築する階層型マルチモーダルオートエンコーダーを導入します。
前者の空間は、知覚的には各モダリティの生の信号空間と同等ですが、信号の次元が大幅に減少します。
後者のスペースは、モダリティ間の情報ギャップを埋める役割を果たし、より洞察力に富んだクロスモーダルなガイダンスを提供します。
私たちが提案した方法は、品質と効率が大幅に向上し、新しい最先端の結果を達成します。
具体的には、私たちの方法は、すべての評価指標の包括的な改善と、ランドスケープおよび AIST++ データセットのトレーニングとサンプリングの速度の高速化を実現します。
さらに、オープンドメインサウンディングビデオ生成、長時間サウンディングビデオ生成、オーディオ継続、ビデオ継続、および条件付きシングルモーダル生成タスクでのパフォーマンスを調査し、包括的な評価を行います。MM-LDM は、優れた適応性と一般化能力を実証します。

要約(オリジナル)

Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information. To address these issues, we introduce a novel multi-modal latent diffusion model (MM-LDM) for the SVG task. We first unify the representation of audio and video data by converting them into a single or a couple of images. Then, we introduce a hierarchical multi-modal autoencoder that constructs a low-level perceptual latent space for each modality and a shared high-level semantic feature space. The former space is perceptually equivalent to the raw signal space of each modality but drastically reduces signal dimensions. The latter space serves to bridge the information gap between modalities and provides more insightful cross-modal guidance. Our proposed method achieves new state-of-the-art results with significant quality and efficiency gains. Specifically, our method achieves a comprehensive improvement on all evaluation metrics and a faster training and sampling speed on Landscape and AIST++ datasets. Moreover, we explore its performance on open-domain sounding video generation, long sounding video generation, audio continuation, video continuation, and conditional single-modal generation tasks for a comprehensive evaluation, where our MM-LDM demonstrates exciting adaptability and generalization ability.

arxiv情報

著者	Mingzhen Sun,Weining Wang,Yanyuan Qiao,Jiahui Sun,Zihan Qin,Longteng Guo,Xinxin Zhu,Jing Liu
発行日	2024-10-02 14:32:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー