Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

要約

条件付きのテキストから画像への生成は、品質、多様性、および忠実度の点で、最近数え切れないほどの改善が見られます。
それにもかかわらず、最先端のモデルのほとんどは、忠実な世代を生成するために多数の推論ステップを必要とするため、エンドユーザーアプリケーションのパフォーマンスのボトルネックが生じます。
このホワイトペーパーでは、573M のパラメーターを持ちながら、500 ミリ秒未満で単一の画像をサンプリングできる速度最適化アーキテクチャを使用して、忠実度の高い画像をサンプリングするのに 10 ステップ未満しか必要としない新しいテキストから画像へのモデルである Paella を紹介します。
このモデルは、圧縮および量子化された潜在空間で動作し、CLIP 埋め込みで調整され、以前の作品よりも改善されたサンプリング関数を使用します。
テキスト条件付き画像生成とは別に、私たちのモデルは、潜在空間補間と、修復、修復、構造編集などの画像操作を行うことができます。
https://github.com/dome272/Paella ですべてのコードと事前トレーニング済みモデルをリリースします

要約(オリジナル)

Conditional text-to-image generation has seen countless recent improvements in terms of quality, diversity and fidelity. Nevertheless, most state-of-the-art models require numerous inference steps to produce faithful generations, resulting in performance bottlenecks for end-user applications. In this paper we introduce Paella, a novel text-to-image model requiring less than 10 steps to sample high-fidelity images, using a speed-optimized architecture allowing to sample a single image in less than 500 ms, while having 573M parameters. The model operates on a compressed & quantized latent space, it is conditioned on CLIP embeddings and uses an improved sampling function over previous works. Aside from text-conditional image generation, our model is able to do latent space interpolation and image manipulations such as inpainting, outpainting, and structural editing. We release all of our code and pretrained models at https://github.com/dome272/Paella

arxiv情報

著者	Dominic Rampas,Pablo Pernias,Elea Zhong,Marc Aubreville
発行日	2022-11-14 11:52:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Fast Text-Conditional Discrete Denoising on Vector-Quantized Latent Spaces

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー