Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

要約

Web ページやスキャン/デジタル生成ドキュメント (画像、PDF など) など、視覚的に豊かなドキュメントの普及が進んでいることにより、学界と産業界全体で自動ドキュメント理解と情報抽出への関心が高まっています。
画像、テキスト、レイアウト、構造などのさまざまな文書モダリティは人間の情報検索を容易にしますが、これらのモダリティの相互接続された性質がニューラルネットワークにとって課題となります。
このペーパーでは、Web ページ内の HTML のテキストと構造モダリティのみをモデリングすることの制限に対処するために設計されたマルチモーダル事前トレーニングネットワークである WebLM を紹介します。
WebLM は、ドキュメント画像を統一された自然画像として処理するのではなく、ドキュメント画像の階層構造を統合して、マークアップ言語ベースのドキュメントの理解を強化します。
さらに、テキスト、構造、画像モダリティ間の相互作用を効果的にモデル化するためのいくつかの事前トレーニングタスクを提案します。
経験的な結果は、事前トレーニングされた WebLM が、いくつかの Web ページ理解タスクにわたって、以前の最先端の事前トレーニングされたモデルを大幅に上回っていることを示しています。
事前トレーニングされたモデルとコードは https://github.com/X-LANCE/weblm で入手できます。

要約(オリジナル)

The growing prevalence of visually rich documents, such as webpages and scanned/digital-born documents (images, PDFs, etc.), has led to increased interest in automatic document understanding and information extraction across academia and industry. Although various document modalities, including image, text, layout, and structure, facilitate human information retrieval, the interconnected nature of these modalities presents challenges for neural networks. In this paper, we introduce WebLM, a multimodal pre-training network designed to address the limitations of solely modeling text and structure modalities of HTML in webpages. Instead of processing document images as unified natural images, WebLM integrates the hierarchical structure of document images to enhance the understanding of markup-language-based documents. Additionally, we propose several pre-training tasks to model the interaction among text, structure, and image modalities effectively. Empirical results demonstrate that the pre-trained WebLM significantly surpasses previous state-of-the-art pre-trained models across several webpage understanding tasks. The pre-trained models and code are available at https://github.com/X-LANCE/weblm.

arxiv情報

著者	Hongshen Xu,Lu Chen,Zihan Zhao,Da Ma,Ruisheng Cao,Zichen Zhu,Kai Yu
発行日	2024-02-28 11:50:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー