Guiding Text-to-Image Diffusion Model Towards Grounded Generation

要約

このホワイトペーパーの目的は、事前にトレーニングされたテキストから画像への拡散モデルを、オープン語彙オブジェクトグラウンディングの機能で強化することです。つまり、テキストプロンプトで説明されている対応する視覚エンティティの画像とセグメンテーションマスクを同時に生成します。
以下の貢献を行います。(i)既存の拡散モデルに接地モジュールを挿入します。これにより、拡散モデルの視覚的およびテキスト埋め込み空間を少数のオブジェクトカテゴリのみに揃えるようにトレーニングできます。
（ii）提案されたグラウンディングモジュールをトレーニングするために、{画像、セグメンテーションマスク、テキストプロンプト}トリプレットで構成されるデータセットを構築するための自動パイプラインを提案します。
（iii）テキストから画像への拡散モデルから生成された画像に対するオープンボキャブラリーグラウンディングのパフォーマンスを評価し、モジュールがトレーニング時に見られるものを超えてカテゴリのオブジェクトを適切にセグメント化できることを示します。
(iv) ガイド付き拡散モデルを採用して合成セマンティックセグメンテーションデータセットを構築し、そのようなデータセットで標準セグメンテーションモデルをトレーニングすると、ゼロショットセグメンテーション (ZS3) ベンチマークで競争力のあるパフォーマンスを示すことを示します。
識別タスクの拡散モデル。

要約(オリジナル)

The goal of this paper is to augment a pre-trained text-to-image diffusion model with the ability of open-vocabulary objects grounding, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we insert a grounding module into the existing diffusion model, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we propose an automatic pipeline for constructing a dataset, that consists of {image, segmentation mask, text prompt} triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time; (iv) we adopt the guided diffusion model to build a synthetic semantic segmentation dataset, and show that training a standard segmentation model on such dataset demonstrates competitive performance on zero-shot segmentation(ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks.

arxiv情報

著者	Ziyi Li,Qinye Zhou,Xiaoyun Zhang,Ya Zhang,Yanfeng Wang,Weidi Xie
発行日	2023-01-12 18:59:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー