Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation

要約

テキストからイメージ（T2I）生成拡散モデルは、テキストキャプションから多様で高品質のビジュアルを合成する際に優れたパフォーマンスを実証しています。
セグメンテーションマップ、エッジ、人間のキーポイントなど、幅広いレイアウトを利用することにより、生成プロセスを制御するためのいくつかのレイアウトモデルが開発されました。
この作業では、objectDiffusionを提案します。これは、セマンティックおよび空間接地情報にT2I拡散モデルを条件付け、境界ボックスによって定義された特定の場所に目的のオブジェクトを正確にレンダリングと配置することを可能にします。
これを達成するために、ControlNetで導入されたネットワークアーキテクチャを大幅に変更して、Gligenで提案された接地方法と統合します。
COCO2017トレーニングデータセットでObjectDiffusionを微調整し、COCO2017検証データセットで評価します。
私たちのモデルは、制御可能な画像生成の精度と品質を改善し、46.6のAP $ _ {\ Text {50}} $を達成し、44.5のAR、および19.8のFIDを達成し、オープンソースデータセットで訓練された現在のSOTAモデルを上回る
3つのメトリックすべてにわたって。
ObjectDiffusionは、セマンティックおよび空間制御レイアウトにシームレスに適合する、多様で高品質の高忠実度の画像を合成する際の特徴的な機能を示しています。
定性的および定量的テストで評価されたObjectDiffusionは、さまざまなコンテキストにわたって閉鎖およびオープンセットの語彙設定で顕著な接地機能を示します。
定性的評価は、さまざまなサイズ、フォーム、および場所で複数の詳細なオブジェクトを生成するObjectDiffusionの能力を検証します。

要約(オリジナル)

Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP$_{\text{50}}$ of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.

arxiv情報

著者	Ahmad Süleyman,Göksel Biricik
発行日	2025-02-10 18:54:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー