VSC: Visual Search Compositional Text-to-Image Diffusion Model

要約

テキストから画像への拡散モデルは、自然言語によるプロンプトからリアルなビジュアルを生成する上で素晴らしい能力を示しているが、特に複数の属性とオブジェクトのペアを含むプロンプトにおいて、属性を対応するオブジェクトに正確に結合することにしばしば苦戦する。この課題は、主にCLIPのような一般的に使用されるテキストエンコーダの限界から生じる。CLIPは、複雑な言語関係や修飾語を効果的にエンコードできないことがある。既存のアプローチでは、推論中のアテンションマップ制御や、学習中のレイアウト情報の利用や微調整により、これらの問題を軽減することが試みられているが、プロンプトの複雑さが増すにつれて性能低下に直面している。本研究では、属性とオブジェクトの結合を改善するために、ペアワイズ画像埋め込みを活用した新しい合成生成手法を紹介する。我々のアプローチは、複雑なプロンプトをサブプロンプトに分解し、対応する画像を生成し、テキスト埋め込みと融合する視覚的プロトタイプを計算することで、表現を強化する。セグメンテーションに基づく定位学習を適用することで、交差注意のズレに対処し、複数の属性をオブジェクトに結合する精度を向上させる。我々のアプローチは、ベンチマークであるT2I CompBenchにおいて、既存のテキストから画像への拡散モデルを凌駕し、人間によって評価された、より良い画質を達成し、プロンプト内の結合ペアの数をスケーリングした場合の頑健性を示す。

要約(オリジナル)

Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts containing multiple attribute-object pairs. This challenge primarily arises from the limitations of commonly used text encoders, such as CLIP, which can fail to encode complex linguistic relationships and modifiers effectively. Existing approaches have attempted to mitigate these issues through attention map control during inference and the use of layout information or fine-tuning during training, yet they face performance drops with increased prompt complexity. In this work, we introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding. Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation. By applying segmentation-based localization training, we address cross-attention misalignment, achieving improved accuracy in binding multiple attributes to objects. Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.

arxiv情報

著者	Do Huu Dat,Nam Hyeonu,Po-Yuan Mao,Tae-Hyun Oh
発行日	2025-05-02 08:31:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

VSC: Visual Search Compositional Text-to-Image Diffusion Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー