Multimodal ELBO with Diffusion Decoders

要約

マルチモーダル変分オートエンコーダは、異なるモダリティを潜在表現にマッピングすることで、異なるモダリティ間の関係を学習する能力を実証してきた。マルチモーダル変分オートエンコーダは、様々なモダリティ間の関係を潜在表現にマッピングすることで学習することができる。しかし、マルチモーダルVAEは、特に画像のような複雑なモダリティが関与する場合、低品質の出力を生成することに悩まされることが多い。加えて、共同分布からサンプリングする際に、生成されたモダリティ間の一貫性が低いことが多い。これらの限界に対処するため、拡散生成モデルを用いたより優れたデコーダを組み込んだ、マルチモーダルVAE ELBOの新しい変形を提案する。拡散デコーダは、モデルが複雑なモダリティを学習し、高品質な出力を生成することを可能にする。また、このマルチモーダルモデルは、異なるタイプのモダリティ用の標準的なフィードフォワードデコーダとシームレスに統合することができ、エンドツーエンドの学習と推論を容易にする。さらに、我々の提案するアプローチの無条件生成能力を強化するために、補助的なスコアベースのモデルを導入する。このアプローチは、従来のマルチモーダルVAEが課す制限に対処し、マルチモーダル生成タスクを改善する新たな可能性を開く。我々のモデルは、生成されたモダリティにおいて、より高い一貫性と優れた品質を有し、異なるデータセットにおいて他のマルチモーダルVAEと比較して最先端の結果を提供する。

要約(オリジナル)

Multimodal variational autoencoders have demonstrated their ability to learn the relationships between different modalities by mapping them into a latent representation. Their design and capacity to perform any-to-any conditional and unconditional generation make them appealing. However, different variants of multimodal VAEs often suffer from generating low-quality output, particularly when complex modalities such as images are involved. In addition to that, they frequently exhibit low coherence among the generated modalities when sampling from the joint distribution. To address these limitations, we propose a new variant of the multimodal VAE ELBO that incorporates a better decoder using a diffusion generative model. The diffusion decoder enables the model to learn complex modalities and generate high-quality outputs. The multimodal model can also seamlessly integrate with a standard feed-forward decoder for different types of modality, facilitating end-to-end training and inference. Furthermore, we introduce an auxiliary score-based model to enhance the unconditional generation capabilities of our proposed approach. This approach addresses the limitations imposed by conventional multimodal VAEs and opens up new possibilities to improve multimodal generation tasks. Our model provides state-of-the-art results compared to other multimodal VAEs in different datasets with higher coherence and superior quality in the generated modalities.

arxiv情報

著者	Daniel Wesego,Pedram Rooshenas
発行日	2025-02-03 05:27:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Multimodal ELBO with Diffusion Decoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー