Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

要約

テキストから画像への生成は、現代のコンピュータービジョンにおける重要な領域であり、生成アーキテクチャの進化を通じて大幅な改善を達成しました。
これらの中には、本質的な品質向上を実証した拡散ベースのモデルがあります。
これらのモデルは通常、ピクセルレベルのアプローチと潜在レベルのアプローチの 2 つのカテゴリに分類されます。
我々は、画像事前モデルの原理と潜在拡散技術を組み合わせた、潜在拡散アーキテクチャの新しい探求である Kandinsky1 を紹介します。
画像事前モデルは、テキスト埋め込みを CLIP の画像埋め込みにマッピングするために個別にトレーニングされます。
提案されたモデルのもう 1 つの特徴は、画像オートエンコーダーコンポーネントとして機能する修正された MoVQ 実装です。
全体として、設計されたモデルには 3.3B のパラメーターが含まれています。
また、テキストと画像の生成、画像の融合、テキストと画像の融合、画像バリエーションの生成、テキストガイドによるインペイント/アウトペイントなど、さまざまな生成モードをサポートするユーザーフレンドリーなデモシステムも展開しました。
さらに、カンディンスキーモデルのソースコードとチェックポイントも公開しました。
実験評価では、COCO-30K データセットの FID スコアが 8.03 であることが実証され、測定可能な画像生成品質の点で、私たちのモデルがオープンソースのトップパフォーマーとしてマークされています。

要約(オリジナル)

Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.

arxiv情報

著者	Anton Razzhigaev,Arseniy Shakhmatov,Anastasia Maltseva,Vladimir Arkhipkin,Igor Pavlov,Ilya Ryabov,Angelina Kuts,Alexander Panchenko,Andrey Kuznetsov,Denis Dimitrov
発行日	2023-10-05 12:29:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent Diffusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー