CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

要約

ビデオとオーディオの双方向の条件付き生成に合わせたマルチモーダル拡散モデルを紹介します。
視覚と聴覚の同期を改善するために、共同対照トレーニング損失を提案します。
提案したモデルの有効性を評価するために、2 つのデータセットでの実験を紹介します。
発電品質と調整パフォーマンスの評価は、客観的指標と主観的指標の両方を含むさまざまな角度から実行されます。
私たちの調査結果は、新しいクロスモーダル easy fusion アーキテクチャブロックの導入により、提案されたモデルが品質と生成速度の点でベースラインを上回っていることを示しています。
さらに、コントラスト損失を組み込むことにより、特に高相関のビデオからオーディオへの生成タスクにおいて、オーディオとビジュアルの調整が改善されます。

要約(オリジナル)

We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences. We present experiments on two datasets to evaluate the efficacy of our proposed model. The assessment of generation quality and alignment performance is carried out from various angles, encompassing both objective and subjective metrics. Our findings demonstrate that the proposed model outperforms the baseline in terms of quality and generation speed through introduction of our novel cross-modal easy fusion architectural block. Furthermore, the incorporation of the contrastive loss results in improvements in audio-visual alignment, particularly in the high-correlation video-to-audio generation task.

arxiv情報

著者	Ruihan Yang,Hannes Gamper,Sebastian Braun
発行日	2024-10-09 16:49:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー