Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

要約

何百万ものデジタル化またはスキャンされたドキュメントに対する教師なしの事前トレーニングは、視覚的なドキュメントの理解~(VDU)において有望な進歩を示しています。
既存のソリューションでは、さまざまな視覚言語の事前トレーニングの目的が研究されていますが、VDU の本質的な粒度としてのドキュメントのテキスト行は、これまでほとんど検討されていません。
ドキュメントのテキスト行には通常、空間的および意味的に相関する単語が含まれており、OCR エンジンから簡単に取得できます。
このホワイトペーパーでは、ドキュメントのテキスト行にネストされた構造的知識を活用するために、新しい事前トレーニングの目的でトレーニングされた Wukong-Reader を提案します。
テキスト行領域の対照的な学習を導入して、視覚領域とドキュメントテキスト行のテキストとの間のきめの細かい配置を実現します。
さらに、マスクされた領域のモデリングとテキスト行とグリッドのマッチングも、テキスト行の視覚的表現とレイアウト表現を強化するように設計されています。
実験では、Wukong-Reader が情報抽出などのさまざまな VDU タスクで優れたパフォーマンスを発揮することが示されています。
また、テキスト行に対するきめの細かい配置により、Wukong-Reader は有望なローカリゼーション機能を利用できます。

要約(オリジナル)

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

arxiv情報

著者	Haoli Bai,Zhiguang Liu,Xiaojun Meng,Wentao Li,Shuang Liu,Nian Xie,Rongfu Zheng,Liangwei Wang,Lu Hou,Jiansheng Wei,Xin Jiang,Qun Liu
発行日	2022-12-19 17:00:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー