ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

要約

拡散モデルは、広範囲のビジュアルコンテンツを生成する際に優れた機能を実証してきましたが、テキストのレンダリングの能力は依然として限られています。多くの場合、基礎となる画像とうまく融合できない不正確な文字や単語が生成されます。
これらの欠点に対処するために、私たちは ARTIST という名前の新しいフレームワークを導入します。このフレームワークには、テキスト構造の学習に特に焦点を当てた専用のテキスト拡散モデルが組み込まれています。
最初に、このテキストモデルを事前トレーニングして、テキスト表現の複雑さを捉えます。
続いて、視覚拡散モデルを微調整して、事前トレーニングされたテキストモデルからテキスト構造情報を同化できるようにします。
この解きほぐされたアーキテクチャ設計とトレーニング戦略により、テキストリッチな画像生成のための拡散モデルのテキストレンダリング能力が大幅に向上します。
さらに、事前トレーニングされた大規模言語モデルの機能を活用して、ユーザーの意図をより適切に解釈し、生成品質の向上に貢献します。
MARIO-Eval ベンチマークの実証結果は、提案された手法の有効性を強調しており、さまざまな指標で最大 15% の改善が示されています。

要約(オリジナル)

Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the capabilities of pretrained large language models to interpret user intentions better, contributing to improved generation quality. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.

arxiv情報

著者	Jianyi Zhang,Yufan Zhou,Jiuxiang Gu,Curtis Wigington,Tong Yu,Yiran Chen,Tong Sun,Ruiyi Zhang
発行日	2024-12-02 10:17:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー