Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

要約

マルチモーダルドキュメントの事前トレーニング済みモデルは、さまざまな視覚的に豊富なドキュメント理解（VrDU）タスクで非常に効果的であることが証明されています。
既存のドキュメントの事前トレーニング済みモデルは、VrDUの標準ベンチマークで優れたパフォーマンスを達成していますが、ドキュメントのビジョンと言語間の相互作用をモデル化して活用する方法により、一般化能力と精度の向上が妨げられています。
この作業では、主に監視信号の観点から、VrDUの視覚と言語の共同表現学習の問題を調査します。
具体的には、Bi-VLDocと呼ばれる事前トレーニングパラダイムが提案されます。このパラダイムでは、双方向の視覚言語監視戦略と視覚言語ハイブリッド注意メカニズムが考案され、これら2つのモダリティ間の相互作用を完全に調査して利用し、より強力なクロスを学習します。
-より豊富なセマンティクスを備えたモーダルドキュメント表現。
学習した有益なクロスモーダルドキュメント表現の恩恵を受けて、Bi-VLDocは、フォーム理解（85.14％から93.44％）、レシート情報抽出（85.14％から93.44％）を含む3つの広く使用されているドキュメント理解ベンチマークで最先端のパフォーマンスを大幅に向上させます。
96.01％から97.84％）、およびドキュメント分類（96.08％から97.12％）。
Document Visual QAでは、Bi-VLDocは、以前の単一モデルの方法と比較して、最先端のパフォーマンスを実現します。

要約(オリジナル)

Multi-modal document pre-trained models have proven to be very effective in a variety of visually-rich document understanding (VrDU) tasks. Though existing document pre-trained models have achieved excellent performance on standard benchmarks for VrDU, the way they model and exploit the interactions between vision and language on documents has hindered them from better generalization ability and higher accuracy. In this work, we investigate the problem of vision-language joint representation learning for VrDU mainly from the perspective of supervisory signals. Specifically, a pre-training paradigm called Bi-VLDoc is proposed, in which a bidirectional vision-language supervision strategy and a vision-language hybrid-attention mechanism are devised to fully explore and utilize the interactions between these two modalities, to learn stronger cross-modal document representations with richer semantics. Benefiting from the learned informative cross-modal document representations, Bi-VLDoc significantly advances the state-of-the-art performance on three widely-used document understanding benchmarks, including Form Understanding (from 85.14% to 93.44%), Receipt Information Extraction (from 96.01% to 97.84%), and Document Classification (from 96.08% to 97.12%). On Document Visual QA, Bi-VLDoc achieves the state-of-the-art performance compared to previous single model methods.

arxiv情報

著者	Chuwei Luo,Guozhi Tang,Qi Zheng,Cong Yao,Lianwen Jin,Chenliang Li,Yang Xue,Luo Si
発行日	2022-06-27 09:58:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー