Golden Noise for Diffusion Models: A Learning Framework

要約

テキストから画像への拡散モデルは、テキストプロンプトとランダムなガウスノイズを提供することでパーソナライズされた画像を合成する一般的なパラダイムです。
一部のノイズは、他のノイズよりもテキストと画像の位置合わせが良好で、人間の好みがより高い「ゴールデンノイズ」であることが観察されていますが、これらのゴールデンノイズを取得するための機械学習フレームワークがまだ不足しています。
拡散サンプリングのゴールデンノイズを学習するために、この論文では主に 3 つの貢献を行います。
まず、\textit{ノイズプロンプト} と呼ばれる新しい概念を特定します。これは、テキストプロンプトから得られる小さな望ましい摂動を追加することによって、ランダムなガウスノイズをゴールデンノイズに変えることを目的としています。
この概念に従って、まず、拡散モデルのテキストプロンプトに関連付けられた「プロンプトされた」ゴールデンノイズを系統的に学習する \textit{ノイズプロンプト学習} フレームワークを定式化します。
次に、ノイズプロンプトデータ収集パイプラインを設計し、ランダムノイズとゴールデンノイズと関連するテキストプロンプトの 100,000 ペアを含む大規模な \textit{ノイズプロンプトデータセット}~(NPD) を収集します。
準備された NPD をトレーニングデータセットとして使用して、ランダムノイズをゴールデンノイズに変換する方法を直接学習できる小さな \textit{ノイズプロンプトネットワーク}~(NPNet) をトレーニングしました。
学習されたゴールデンノイズ摂動は、意味論的な情報が豊富で、指定されたテキストプロンプトに合わせて調整されているため、ノイズに対する一種のプロンプトと考えることができます。
第三に、私たちの広範な実験は、SDXL、DreamShaper-xl-v2-turbo、Hunyuan-DiT を含むさまざまな拡散モデルにわたる合成画像の品質向上における NPNet の優れた有効性と一般化を実証しています。
さらに、NPNet は、元のパイプラインにアクセスせずにランダムノイズの代わりにゴールデンノイズを提供するだけなので、追加の推論コストと計算コストが非常に少ないプラグアンドプレイモジュールとして機能する小型で効率的なコントローラーです。

要約(オリジナル)

Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are “golden noises” that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those golden noises. To learn golden noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the \textit{noise prompt}, which aims at turning a random Gaussian noise into a golden noise by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the \textit{noise prompt learning} framework that systematically learns “prompted” golden noise associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale \textit{noise prompt dataset}~(NPD) that contains 100k pairs of random noises and golden noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small \textit{noise prompt network}~(NPNet) that can directly learn to transform a random noise into a golden noise. The learned golden noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a golden noise instead of a random noise without accessing the original pipeline.

arxiv情報

著者	Zikai Zhou,Shitong Shao,Lichen Bai,Zhiqiang Xu,Bo Han,Zeke Xie
発行日	2024-11-14 15:13:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Golden Noise for Diffusion Models: A Learning Framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー