Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

要約

既存のシーンテキスト検出方法は通常、トレーニングのために広範な実際のデータに依存しています。
注釈付きの実際の画像が不足しているため、最近の研究では、テキスト検出器の事前トレーニングに大規模なラベル付き合成データ (LSD) を活用することが試みられています。
ただし、シンセと実際の領域のギャップが生じ、テキスト検出器のパフォーマンスがさらに制限されます。
これとは異なり、この研究では、LSD とラベルなし実データ (URD) の両方の補完的な長所を可能にする、実ドメインに合わせた事前トレーニングパラダイムである FreeReal を提案します。
具体的には、事前トレーニングのために現実世界と合成世界を橋渡しするために、グリフベースの混合メカニズム (GlyphMix) がテキスト画像に合わせて調整されています。GlyphMix は、合成画像の文字構造を描写し、それらを落書きのようなユニットとして実際の画像に埋め込みます。
実際のドメインドリフトを導入することなく、GlyphMix は合成ラベルから派生した注釈を備えた現実世界の画像を自由に生成します。
さらに、無料のきめの細かい合成ラベルが与えられると、GlyphMix は、英語が主流の LSD からさまざまな言語の URD に起因する言語ドメインのギャップを効果的に埋めることができます。
余分な機能がなければ、FreeReal は FCENet、PSENet、PANet、DBNet メソッドのパフォーマンス向上においてそれぞれ 1.97%、3.90%、3.85%、4.56% の平均利益を達成し、一貫して以前の事前トレーニングメソッドを大幅に上回っています。
4 つの公開データセットにわたって。
コードは https://github.com/SJTU-DeepVisionLab/FreeReal で入手できます。

要約(オリジナル)

Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal.

arxiv情報

著者	Tongkun Guan,Wei Shen,Xue Yang,Xuehui Wang,Xiaokang Yang
発行日	2024-07-10 15:49:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bridging Synthetic and Real Worlds for Pre-training Scene Text Detectors

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー