VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

要約

拡散モデルは、テキストから画像への生成において並外れた才能を発揮しますが、それでも、非常に美的な画像を生成できない可能性があります。
より具体的には、生成された画像と実際の美的画像との間には、色、照明、構成などを含むより細かい次元でのギャップがまだ存在します。この論文では、クロスアテンション値混合制御 (VMix) アダプターを提案します。
プラグアンドプレイの美学アダプター。(1) 美的埋め込みの初期化によって入力テキストプロンプトをコンテンツの説明と美的説明に解きほぐし、(2) 統合することで、視覚的概念全体にわたる汎用性を維持しながら、生成された画像の品質をアップグレードします。
ゼロ初期化された線形レイヤーで接続されたネットワークを使用して、価値混合クロスアテンションを通じて美的条件をノイズ除去プロセスに取り込みます。
私たちの重要な洞察は、画像とテキストの位置合わせを維持しながら、優れた条件制御方法を設計することで、既存の拡散モデルの美的表現を強化することです。
VMix は綿密な設計により、コミュニティモデルに適用できる柔軟性を備えており、再トレーニングすることなく視覚的なパフォーマンスを向上させます。
私たちの方法の有効性を検証するために、私たちは広範な実験を実施し、VMix が他の最先端の方法よりも優れており、画像生成に関して他のコミュニティモジュール (LoRA、ControlNet、IPAdapter など) と互換性があることを示しました。
プロジェクトページは https://vmix-diffusion.github.io/VMix/ です。

要約(オリジナル)

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.

arxiv情報

著者	Shaojin Wu,Fei Ding,Mengqi Huang,Wei Liu,Qian He
発行日	2024-12-30 08:47:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー