PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

要約

本稿では、4K 解像度の画像を直接生成できる拡散変換モデル (DiT) である PixArt-\Sigma を紹介します。
PixArt-\Sigma は、以前の PixArt-\alpha に比べて大幅に進歩しており、著しく忠実度が高く、テキストプロンプトとの整合性が向上した画像を提供します。
PixArt-\Sigma の重要な機能は、トレーニングの効率です。
PixArt-\alpha の基礎的な事前トレーニングを活用し、高品質のデータを組み込むことで「弱い」ベースラインから「より強い」モデルに進化します。このプロセスを「弱から強へのトレーニング」と呼んでいます。
PixArt-\Sigma の進歩は 2 つあります。 (1) 高品質のトレーニングデータ: PixArt-\Sigma には、より正確で詳細な画像キャプションと組み合わせられた、高品質の画像データが組み込まれています。
(2) 効率的なトークン圧縮: DiT フレームワーク内でキーと値の両方を圧縮し、効率を大幅に向上させ、超高解像度画像の生成を容易にする新しいアテンションモジュールを提案します。
これらの改善のおかげで、PixArt-\Sigma は、SDXL (2.6B パラメーター) や SD Cascade (
5.1Bパラメータ）。
さらに、PixArt-\Sigma の 4K 画像生成機能は、高解像度のポスターや壁紙の作成をサポートし、映画やゲームなどの業界における高品質のビジュアルコンテンツの制作を効率的に強化します。

要約(オリジナル)

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker’ baseline to a `stronger’ model via incorporating higher quality data, a process we term ‘weak-to-strong training’. The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma’s capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

arxiv情報

著者	Junsong Chen,Chongjian Ge,Enze Xie,Yue Wu,Lewei Yao,Xiaozhe Ren,Zhongdao Wang,Ping Luo,Huchuan Lu,Zhenguo Li
発行日	2024-03-07 17:41:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー