Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

要約

深い生成モデルは多くの注目を集めていますが、既存の作品のほとんどは単峰生成用に設計されています。
この論文では、無条件の画像とテキストのペアを生成するための新しい方法を探ります。
マルチモーダルクロス量子化 VAE (MXQ-VAE) を設計します。これは、画像とテキストの結合表現用の新しいベクトル量子化器であり、画像とテキストの結合表現空間が意味的に一貫した画像とテキストのペア生成に効果的であることを発見しました。
量子化された空間でマルチモーダルなセマンティック相関を学習するには、VQ-VAE を Transformer エンコーダーと組み合わせて、入力マスキング戦略を適用します。
具体的には、MXQ-VAE はマスクされた画像とテキストのペアを入力として受け入れ、量子化された結合表現空間を学習して、入力を統一されたコードシーケンスに変換できるようにし、コードシーケンスを使用して無条件の画像とテキストのペア生成を実行します。
広範な実験により、量子化された関節空間と、合成データセットおよび現実世界のデータセットに対するマルチモーダル生成機能との間の相関関係が示されています。
さらに、いくつかのベースラインよりもこれらの 2 つの側面で私たちのアプローチの優位性を示しています。
ソースコードは、https://github.com/ttumyche/MXQ-VAE で公開されています。

要約(オリジナル)

Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.

arxiv情報

著者	Hyungyung Lee,Sungjin Park,Joonseok Lee,Edward Choi
発行日	2022-10-14 13:01:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー