Training-Free Consistent Text-to-Image Generation

要約

Text-to-imageモデルは、ユーザが自然言語によって画像生成プロセスをガイドできるようにすることで、新しいレベルの創造的な柔軟性を提供します。しかし、これらのモデルを使用して、多様なプロンプト間で同じ被写体を一貫して描写することは、依然として困難です。既存のアプローチでは、ユーザーから提供された特定の被写体を説明する新しい単語をモデルに教えたり、画像条件をモデルに追加したりして、モデルを微調整しています。これらの方法は、被写体ごとの最適化や大規模な事前学習を必要とする。さらに、生成された画像とテキストプロンプトの位置合わせに苦労したり、複数の被写体を描写することの難しさに直面したりする。ここでは、事前学習されたモデルの内部活性を共有することで、一貫性のある被験者生成を可能にするトレーニング不要のアプローチ、ConsiStoryを紹介する。画像間の被写体の一貫性を促進するために、被写体駆動型の共有注意ブロックと対応ベースの特徴注入を導入する。さらに、被写体の一貫性を維持しながらレイアウトの多様性を促進する戦略を開発する。ConsiStoryを様々なベースラインと比較し、最適化ステップを一度も必要とせずに、被写体の一貫性とテキストのアライメントにおいて最先端の性能を実証する。最後に、ConsiStoryは複数の被写体のシナリオに自然に拡張することができ、一般的なオブジェクトに対してトレーニングなしでパーソナライズすることも可能です。

要約(オリジナル)

Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

arxiv情報

著者	Yoad Tewel,Omri Kaduri,Rinon Gal,Yoni Kasten,Lior Wolf,Gal Chechik,Yuval Atzmon
発行日	2024-02-05 18:42:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Training-Free Consistent Text-to-Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー