Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

要約

拡散確率モデル (DPM) は、有望な結果とクロスモーダル合成のサポートにより、条件付き生成への一般的なアプローチになりました。
条件付き合成の重要な要件は、条件付け入力と生成された出力との間で高い対応を達成することです。
ほとんどの既存のメソッドは、事前確率を変分下限に組み込むことによって、このような関係を暗黙的に学習します。
この作業では、別のルートを取ります。相互情報を最大化することにより、入出力接続を明示的に強化します。
この目的のために、Conditional Discrete Contrastive Diffusion (CDCD) 損失を導入し、それをノイズ除去プロセスに効果的に組み込むための 2 つの対照的な拡散メカニズムを設計し、拡散トレーニングと対照的な学習を従来の変分目標と接続することで初めて組み合わせます。
ダンスから音楽への生成、テキストから画像への合成、クラス条件付き画像合成など、さまざまなマルチモーダル条件付き合成タスクを使用した評価におけるアプローチの有効性を示します。
それぞれで、入出力対応を強化し、より高い、または競争力のある一般的な合成品質を実現します。
さらに、提案されたアプローチは拡散モデルの収束を改善し、必要な拡散ステップの数を 2 つのベンチマークで 35% 以上削減し、推論速度を大幅に向上させます。

要約(オリジナル)

Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route — we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.

arxiv情報

著者	Ye Zhu,Yu Wu,Kyle Olszewski,Jian Ren,Sergey Tulyakov,Yan Yan
発行日	2023-02-16 18:00:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Discrete Contrastive Diffusion for Cross-Modal Music and Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー