Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

要約

Dall-Eや安定した拡散などのテキストから画像間モデルは、広告、パーソナライズされたメディア、デザインプロトタイピングなど、さまざまなアプリケーションにわたって視覚コンテンツの作成に革命をもたらしました。
ただし、これらのモデルを導くための効果的なテキストプロンプトを作成することは困難なままであり、多くの場合、広範な試行錯誤が必要です。
ソフトでハードプロンプトのテクニックなどの既存の迅速な反転アプローチは、制限された解釈可能性と一貫性のない迅速な生成のためにそれほど効果的ではありません。
これらの問題に対処するために、視覚的にガイド付きデコード（VGD）、大規模な言語モデル（LLM）を活用するグラデーションフリーアプローチ、およびコヒーレントで意味的に整合したプロンプトを生成するクリップベースのガイダンスを提案します。
本質的に、VGDはLLMの堅牢なテキスト生成機能を利用して、人間が読みやすいプロンプトを生成します。
さらに、クリップスコアを使用してユーザー指定の視覚概念との連携を確保することにより、VGDは、追加のトレーニングを必要とせずに、迅速な生成の解釈可能性、一般化、柔軟性を高めます。
私たちの実験は、VGDが理解可能で文脈的に関連するプロンプトを生成する際に既存の迅速な反転技術を上回り、テキスト間モデルとのより直感的で制御可能な相互作用を促進することを示しています。

要約(オリジナル)

Text-to-image generative models like DALL-E and Stable Diffusion have revolutionized visual content creation across various applications, including advertising, personalized media, and design prototyping. However, crafting effective textual prompts to guide these models remains challenging, often requiring extensive trial and error. Existing prompt inversion approaches, such as soft and hard prompt techniques, are not so effective due to the limited interpretability and incoherent prompt generation. To address these issues, we propose Visually Guided Decoding (VGD), a gradient-free approach that leverages large language models (LLMs) and CLIP-based guidance to generate coherent and semantically aligned prompts. In essence, VGD utilizes the robust text generation capabilities of LLMs to produce human-readable prompts. Further, by employing CLIP scores to ensure alignment with user-specified visual concepts, VGD enhances the interpretability, generalization, and flexibility of prompt generation without the need for additional training. Our experiments demonstrate that VGD outperforms existing prompt inversion techniques in generating understandable and contextually relevant prompts, facilitating more intuitive and controllable interactions with text-to-image models.

arxiv情報

著者	Donghoon Kim,Minji Bae,Kyuhong Shim,Byonghyo Shim
発行日	2025-05-13 14:40:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー