Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

要約

最近、Glyph-ByT5 は、グラフィックデザイン画像における高精度のビジュアルテキストレンダリングパフォーマンスを実現しました。
ただし、依然として英語のみに焦点を当てており、視覚的な魅力の点では比較的パフォーマンスが劣っています。
この作業では、Glyph-ByT5-v2 と Glyph-SDXL-v2 を提示することで、これら 2 つの基本的な制限に対処します。これらは、10 の異なる言語での正確なビジュアルテキストレンダリングをサポートするだけでなく、はるかに優れた美的品質も実現します。
これを達成するために、私たちは次の貢献を行います。(i) 他の 9 つの言語をカバーする 100 万以上のグリフテキストペアと 1,000 万以上のグラフィックデザインイメージテキストペアで構成される高品質の多言語グリフテキストおよびグラフィックデザインデータセットを作成します。
(ii) 多言語の視覚的なスペルの正確さを評価するために、言語ごとに 100 個のプロンプトを含む 1,000 個のプロンプトで構成される多言語視覚段落ベンチマークを構築すること、(iii) 最新のステップ認識型嗜好学習アプローチを活用して、視覚的な美的品質を向上させること。
これらの技術を組み合わせることで、10 の異なる言語で正確なスペルをサポートできる、強力でカスタマイズされた多言語テキストエンコーダー Glyph-ByT5-v2 と、強力な美的グラフィック生成モデル Glyph-SDXL-v2 を提供します。
最新の DALL-E3 と Ideogram 1.0 が依然として多言語ビジュアルテキストレンダリングタスクに苦労していることを考慮すると、私たちは自分たちの取り組みが重要な進歩であると認識しています。

要約(オリジナル)

Recently, Glyph-ByT5 has achieved highly accurate visual text rendering performance in graphic design images. However, it still focuses solely on English and performs relatively poorly in terms of visual appeal. In this work, we address these two fundamental limitations by presenting Glyph-ByT5-v2 and Glyph-SDXL-v2, which not only support accurate visual text rendering for 10 different languages but also achieve much better aesthetic quality. To achieve this, we make the following contributions: (i) creating a high-quality multilingual glyph-text and graphic design dataset consisting of more than 1 million glyph-text pairs and 10 million graphic design image-text pairs covering nine other languages, (ii) building a multilingual visual paragraph benchmark consisting of 1,000 prompts, with 100 for each language, to assess multilingual visual spelling accuracy, and (iii) leveraging the latest step-aware preference learning approach to enhance the visual aesthetic quality. With the combination of these techniques, we deliver a powerful customized multilingual text encoder, Glyph-ByT5-v2, and a strong aesthetic graphic generation model, Glyph-SDXL-v2, that can support accurate spelling in 10 different languages. We perceive our work as a significant advancement, considering that the latest DALL-E3 and Ideogram 1.0 still struggle with the multilingual visual text rendering task.

arxiv情報

著者	Zeyu Liu,Weicong Liang,Yiming Zhao,Bohan Chen,Lin Liang,Lijuan Wang,Ji Li,Yuhui Yuan
発行日	2024-07-12 16:26:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Glyph-ByT5-v2: A Strong Aesthetic Baseline for Accurate Multilingual Visual Text Rendering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー