Grounded Text-to-Image Synthesis with Attention Refocusing

要約

大規模なテキストと画像のペアデータセットでトレーニングされたスケーラブルな拡散モデルによって駆動される、テキストから画像への合成手法は、説得力のある結果を示しています。
ただし、複数のオブジェクト、属性、空間構成がプロンプトに含まれる場合、これらのモデルは依然としてテキストプロンプトに正確に従うことができません。
この論文では、拡散モデルのクロスアテンション層とセルフアテンション層の両方における潜在的な理由を特定します。
サンプリングプロセス中に特定のレイアウトに従ってアテンションマップを再焦点合わせするための2つの新しい損失を提案します。
大規模言語モデルによって合成されたレイアウトを使用して、DrawBench と HRS ベンチマークで包括的な実験を実行し、提案された損失が既存のテキストから画像への方法に簡単かつ効果的に統合でき、生成された画像とテキストプロンプトの間の位置合わせを一貫して改善できることを示しています。
。

要約(オリジナル)

Driven by scalable diffusion models trained on large-scale paired text-image datasets, text-to-image synthesis methods have shown compelling results. However, these models still fail to precisely follow the text prompt when multiple objects, attributes, and spatial compositions are involved in the prompt. In this paper, we identify the potential reasons in both the cross-attention and self-attention layers of the diffusion model. We propose two novel losses to refocus the attention maps according to a given layout during the sampling process. We perform comprehensive experiments on the DrawBench and HRS benchmarks using layouts synthesized by Large Language Models, showing that our proposed losses can be integrated easily and effectively into existing text-to-image methods and consistently improve their alignment between the generated images and the text prompts.

arxiv情報

著者	Quynh Phung,Songwei Ge,Jia-Bin Huang
発行日	2023-06-08 17:59:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grounded Text-to-Image Synthesis with Attention Refocusing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー