ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

要約

画像合成の分野は、ここ数年で大きく進歩しました。
テキストプロンプトを使用して目的の出力イメージを定義することに加えて、直感的なアプローチは、深度マップなどのイメージ形式の空間ガイダンスを追加で使用することです。
このため、最近の非常に一般的なアプローチは、ControlNet などの制御ネットワークを、Stable Diffusion などの事前トレーニングされた画像生成モデルと組み合わせて使用することです。
既存の制御ネットワークの設計を評価すると、それらはすべて、生成プロセスと制御プロセスの間を流れる情報の遅延という同じ問題に悩まされていることがわかります。
これは、制御ネットワークが生成機能を備えている必要があることを意味します。
この研究では、ControlNet-XS と呼ばれる新しい制御アーキテクチャを提案します。このアーキテクチャでは、この問題に悩まされず、制御を学習するという与えられたタスクに集中できます。
ControlNet とは対照的に、私たちのモデルはほんの一部のパラメーターのみを必要とするため、推論とトレーニングの時間は約 2 倍高速になります。
さらに、生成される画像の品質は向上し、制御の忠実度も高まります。
すべてのコードと事前トレーニングされたモデルは公開されます。

要約(オリジナル)

The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. For this, a recent and highly popular approach is to use a controlling network, such as ControlNet, in combination with a pre-trained image generation model, such as Stable Diffusion. When evaluating the design of existing controlling networks, we observe that they all suffer from the same problem of a delay in information flowing between the generation and controlling process. This, in turn, means that the controlling network must have generative capabilities. In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem, and hence can focus on the given task of learning to control. In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time. Furthermore, the generated images are of higher quality and the control is of higher fidelity. All code and pre-trained models will be made publicly available.

arxiv情報

著者	Denis Zavadski,Johann-Friedrich Feiden,Carsten Rother
発行日	2023-12-11 17:58:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー