Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

要約

被験者主導のテキストから画像への拡散モデルを使用すると、ユーザーはいくつかのサンプル画像を使用して、事前トレーニングデータセットにはない新しい概念に合わせてモデルを調整できます。
しかし、普及している主題駆動モデルは主に単一概念の入力画像に依存しており、複数概念の入力画像を扱う際にターゲット概念を指定するという課題に直面しています。
この目的を達成するために、マルチコンセプト入力画像を処理するために、テキストローカライズされたテキストから画像へのモデル (Texual Localization) を導入します。
微調整中に、私たちの方法には、複数の概念を分解するための新しいクロスアテンションガイダンスが組み込まれており、ターゲット概念の視覚的表現とテキストプロンプト内の識別子トークンの間の明確な接続が確立されます。
実験結果は、私たちの方法が、マルチコンセプト入力画像における画像の忠実性と画像とテキストの位置合わせの点で、ベースラインモデルよりも優れている、またはそれに匹敵するパフォーマンスを発揮することを明らかにしています。
カスタム拡散と比較して、ハードガイダンスを使用した私たちの方法は、単一コンセプト生成とマルチコンセプト生成でそれぞれ 7.04%、8.13% 高い CLIP-I スコアと 2.22%、5.85% 高い CLIP-T スコアを達成しました。
特に、私たちの方法は、生成された画像内のターゲットコンセプトと一致するクロスアテンションマップを生成しますが、これは既存のモデルにはない機能です。

要約(オリジナル)

Subject-driven text-to-image diffusion models empower users to tailor the model to new concepts absent in the pre-training dataset using a few sample images. However, prevalent subject-driven models primarily rely on single-concept input images, facing challenges in specifying the target concept when dealing with multi-concept input images. To this end, we introduce a textual localized text-to-image model (Texual Localization) to handle multi-concept input images. During fine-tuning, our method incorporates a novel cross-attention guidance to decompose multiple concepts, establishing distinct connections between the visual representation of the target concept and the identifier token in the text prompt. Experimental results reveal that our method outperforms or performs comparably to the baseline models in terms of image fidelity and image-text alignment on multi-concept input images. In comparison to Custom Diffusion, our method with hard guidance achieves CLIP-I scores that are 7.04%, 8.13% higher and CLIP-T scores that are 2.22%, 5.85% higher in single-concept and multi-concept generation, respectively. Notably, our method generates cross-attention maps consistent with the target concept in the generated images, a capability absent in existing models.

arxiv情報

著者	Junjie Shentu,Matthew Watson,Noura Al Moubayed
発行日	2024-02-15 14:19:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー