Masked Autoencoders Are Effective Tokenizers for Diffusion Models

要約

潜在的な拡散モデルの最近の進歩により、高解像度の画像合成に対する有効性が実証されています。
ただし、拡散モデルのより良い学習と生成のためのトークン剤からの潜在空間の特性は、未調査のままです。
理論的にも経験的には、生成品質の改善は、ガウス混合モードが少なく、より識別的な特徴を備えたものなど、より良い構造を持つ潜在的な分布と密接に結びついていることがわかります。
これらの洞察に動機付けられて、私たちは、再構築の忠実度を維持しながら、意味的に豊富な潜在スペースを学習するために、マスクモデリングを活用する自動エンコーダー（AE）のMaetokを提案します。
大規模な実験では、私たちの分析を検証し、自動エンコーダーの変動型が必要ではないことを示し、AEだけからの識別的潜在スペースは、128トークンのみを使用してImagenet世代の最先端のパフォーマンスを可能にします。
Maetokは大幅な実用的な改善を達成し、512×512の生成で76倍のトレーニングと31倍の推論スループットで1.69のGFIDを可能にします。
私たちの調査結果は、変分の制約ではなく、潜在空間の構造が効果的な拡散モデルに重要であることを示しています。
コードと訓練されたモデルがリリースされます。

要約(オリジナル)

Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76x faster training and 31x higher inference throughput for 512×512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models are released.

arxiv情報

著者	Hao Chen,Yujin Han,Fangyi Chen,Xiang Li,Yidong Wang,Jindong Wang,Ze Wang,Zicheng Liu,Difan Zou,Bhiksha Raj
発行日	2025-02-05 18:42:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Masked Autoencoders Are Effective Tokenizers for Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー