EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

要約

テキストから画像への生成モデル～（Stable Diffusionなど）は大きな進歩を遂げ、テキスト記述に基づく高品質でリアルな画像の作成を可能にしている。プロンプトの反転は、特定のアーチファクトを生成するために使用されたテキストプロンプトを特定するタスクであり、データの帰属、モデルの出所、電子透かしの検証を含むアプリケーションのための重要な可能性を秘めている。最近の研究では、語彙空間を代表するプロンプトを最適化する遅延投影スキームが導入されたが、意味的な流暢さと効率性には課題が残る。高度な画像キャプションモデルや視覚的な大規模言語モデルは、非常に解釈しやすいプロンプトを生成することができるが、画像の類似性に欠けることが多い。本論文では、テキストから画像への拡散モデルのためのプロンプト逆変換技術(˶‾˶‾˶‾˶‾˶‾˶‾˶‾˶‾˶‾˶‾˶‾˶‾˵)を提案する。MS COCO、LAION、Flickrなどの広く利用されているデータセットを用いた実験により、我々の手法が、画像の類似性、テキストの整列、迅速な解釈可能性、汎用性の点で既存の手法を凌駕することが示された。さらに、我々の生成したプロンプトを、クロスコンセプト画像合成、コンセプト操作、進化的マルチコンセプト生成、教師なしセグメンテーションなどのタスクに応用した例を示す。

要約(オリジナル)

Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

arxiv情報

著者	Mingzhe Li,Gehao Zhang,Zhenting Wang,Shiqing Ma,Siqi Pan,Richard Cartwright,Juan Zhai
発行日	2025-06-03 16:44:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー