Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

要約

拡散モデルはフォトリアリスティックな画像の生成において目覚ましい成功を収めていますが、入力プロンプトとの正確な意味の整合性を確保するという点では課題が残っています。
初期のノイズの多い潜在を最適化すると、モデルアーキテクチャを変更したり、セマンティックアラインメントを改善するためのプロンプトエンジニアリングに代わる、より効率的な代替手段が提供されます。
最新のアプローチである InitNo は、アテンションマップを活用して初期のノイズの多い潜在を改善します。
ただし、これらのマップは限られた情報のみをキャプチャし、InitNo の有効性は最初の開始点に大きく依存します。これは、この点付近の局所的な最適値に収束する傾向があるためです。
この目的を達成するために、この論文では、ラージビジョン言語モデル (LVLM) の言語理解機能を活用して初期のノイジーレイテントの最適化をガイドすることを提案し、ノイジーレイテントを更新して意味的に忠実なイメージを維持しながら生成するノイズ拡散プロセスを紹介します。
配布の一貫性。
さらに、更新によってセマンティックな忠実性が向上する条件の理論的分析を提供します。
実験結果は、私たちのフレームワークの有効性と適応性を実証し、さまざまな拡散モデル間でセマンティックの整合性を一貫して強化しています。
コードは https://github.com/Bomingmiao/NoiseDiffusion で入手できます。

要約(オリジナル)

Diffusion models have achieved impressive success in generating photorealistic images, but challenges remain in ensuring precise semantic alignment with input prompts. Optimizing the initial noisy latent offers a more efficient alternative to modifying model architectures or prompt engineering for improving semantic alignment. A latest approach, InitNo, refines the initial noisy latent by leveraging attention maps; however, these maps capture only limited information, and the effectiveness of InitNo is highly dependent on the initial starting point, as it tends to converge on a local optimum near this point. To this end, this paper proposes leveraging the language comprehension capabilities of large vision-language models (LVLMs) to guide the optimization of the initial noisy latent, and introduces the Noise Diffusion process, which updates the noisy latent to generate semantically faithful images while preserving distribution consistency. Furthermore, we provide a theoretical analysis of the condition under which the update improves semantic faithfulness. Experimental results demonstrate the effectiveness and adaptability of our framework, consistently enhancing semantic alignment across various diffusion models. The code is available at https://github.com/Bomingmiao/NoiseDiffusion.

arxiv情報

著者	Boming Miao,Chunxiao Li,Xiaoxiao Wang,Andi Zhang,Rui Sun,Zizhe Wang,Yao Zhu
発行日	2024-11-25 15:40:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー