GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

要約

この研究では、画像とビデオの生成のための Transformer ベースの拡散モデルを調査します。
Transformer アーキテクチャはその柔軟性とスケーラビリティによりさまざまな分野で優勢ですが、ビジュアル生成ドメインでは主に CNN ベースの U-Net アーキテクチャが、特に拡散ベースのモデルで利用されています。
このギャップに対処するために、Transformer ベースの拡散を採用した生成モデルのファミリーである GenTron を紹介します。
私たちの最初のステップは、拡散トランスフォーマー (DiT) をクラスからテキストコンディショニングに適応させることでした。このプロセスには、コンディショニングメカニズムの徹底的な経験的調査が含まれます。
次に、GenTron のパラメータを約 900M から 30 億を超えるパラメータにスケールし、ビジュアル品質の大幅な向上を観察しました。
さらに、GenTron をテキストからビデオへの生成まで拡張し、新しいモーションフリーガイダンスを組み込んでビデオ品質を向上させます。
SDXL に対する人間による評価では、GenTron はビジュアル品質で 51.1% の勝率 (描画率 19.8%)、テキスト配置で 42.3% の勝率 (描画率 42.9%) を達成しました。
GenTron は T2I-CompBench にも優れており、構成生成における強みを強調しています。
私たちは、この研究が有意義な洞察を提供し、将来の研究の貴重な参考になると信じています。

要約(オリジナル)

In this study, we explore Transformer-based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability, the visual generative domain primarily utilizes CNN-based U-Net architectures, particularly in diffusion-based models. We introduce GenTron, a family of Generative models employing Transformer-based diffusion, to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning, a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters, observing significant improvements in visual quality. Furthermore, we extend GenTron to text-to-video generation, incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL, GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate), and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron also excels in the T2I-CompBench, underscoring its strengths in compositional generation. We believe this work will provide meaningful insights and serve as a valuable reference for future research.

arxiv情報

著者	Shoufa Chen,Mengmeng Xu,Jiawei Ren,Yuren Cong,Sen He,Yanping Xie,Animesh Sinha,Ping Luo,Tao Xiang,Juan-Manuel Perez-Rua
発行日	2023-12-07 18:59:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー