Control and Realism: Best of Both Worlds in Layout-to-Image without Training

要約

レイアウトからイメージの生成は、被験者の配置と配置を正確に制御する複雑なシーンを作成することを目的としています。
既存の作品は、事前に訓練されたテキストから画像間拡散モデルが特定のデータをトレーニングせずにこの目標を達成できることを実証しています。
しかし、彼らはしばしば不正確なローカリゼーションと非現実的なアーティファクトで課題に直面しています。
これらの欠点に焦点を当てて、斬新なトレーニングなしの方法であるWinwinlayを提案します。
その中心で、Winwinlayは、制御の精度とリアリズムを共同で強化する2つの重要な戦略、非ローカルな注意エネルギー機能と適応的な更新を提示します。
一方では、一般的に使用される注意エネルギー関数が固有の空間分布バイアスを導入し、オブジェクトがレイアウト命令と均一に整合するのを妨げることを理論的に実証します。
この問題を克服するために、非ローカルな注意事項を調査して注意スコアを再配布し、指定された空間条件によりよく準拠するようにオブジェクトを促進します。
一方、バニラバックプロパゲーション更新ルールが事前に訓練されたドメインからの逸脱を引き起こし、分散型のアーティファクトにつながる可能性があることを特定します。
それに応じて、レイアウトの制約を尊重しながらドメイン内の更新を促進する治療法として、Langevin Dynamicsベースの適応更新スキームを導入します。
広範な実験は、Winwinlayが要素の配置を制御し、フォトリアリックな視覚的忠実度を達成し、現在の最先端の方法を上回ることを実証しています。

要約(オリジナル)

Layout-to-Image generation aims to create complex scenes with precise control over the placement and arrangement of subjects. Existing works have demonstrated that pre-trained Text-to-Image diffusion models can achieve this goal without training on any specific data; however, they often face challenges with imprecise localization and unrealistic artifacts. Focusing on these drawbacks, we propose a novel training-free method, WinWinLay. At its core, WinWinLay presents two key strategies, Non-local Attention Energy Function and Adaptive Update, that collaboratively enhance control precision and realism. On one hand, we theoretically demonstrate that the commonly used attention energy function introduces inherent spatial distribution biases, hindering objects from being uniformly aligned with layout instructions. To overcome this issue, non-local attention prior is explored to redistribute attention scores, facilitating objects to better conform to the specified spatial conditions. On the other hand, we identify that the vanilla backpropagation update rule can cause deviations from the pre-trained domain, leading to out-of-distribution artifacts. We accordingly introduce a Langevin dynamics-based adaptive update scheme as a remedy that promotes in-domain updating while respecting layout constraints. Extensive experiments demonstrate that WinWinLay excels in controlling element placement and achieving photorealistic visual fidelity, outperforming the current state-of-the-art methods.

arxiv情報

著者	Bonan Li,Yinhan Hu,Songhua Liu,Xinchao Wang
発行日	2025-06-18 15:39:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Control and Realism: Best of Both Worlds in Layout-to-Image without Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー