GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

要約

ビジュアルドキュメント理解 (VDU) は、強力なマルチモーダル言語モデルの開発により急速に進歩しました。
ただし、これらのモデルは通常、中間表現を学習するために大規模なドキュメントの事前トレーニングデータを必要とし、現実世界のオンライン産業環境ではパフォーマンスが大幅に低下することがよくあります。
主な問題は、文書ページ内のローカルな位置情報を抽出するために OCR エンジンに大きく依存していることです。これにより、グローバル情報を取得するモデルの能力が制限され、汎用性、柔軟性、堅牢性が妨げられます。
このペーパーでは、3 つの新しい口実客観的タスクを使用して自己教師ありの方法で事前トレーニングされたクロスモーダルトランスフォーマーベースのアーキテクチャである GlobalDoc を紹介します。
GlobalDoc は、言語と視覚表現を統合することで、より豊かな意味概念の学習を改善し、より転送可能なモデルを実現します。
適切な評価のために、産業シナリオをより厳密にシミュレートするように設計された 2 つの新しいドキュメントレベルのダウンストリーム VDU タスク、フューショットドキュメント画像分類 (DIC) とコンテンツベースのドキュメント画像検索 (DIR) も提案します。
実際の設定における GlobalDoc の有効性を実証するために、広範な実験が行われています。

要約(オリジナル)

Visual document understanding (VDU) has rapidly advanced with the development of powerful multi-modal language models. However, these models typically require extensive document pre-training data to learn intermediate representations and often suffer a significant performance drop in real-world online industrial settings. A primary issue is their heavy reliance on OCR engines to extract local positional information within document pages, which limits the models’ ability to capture global information and hinders their generalizability, flexibility, and robustness. In this paper, we introduce GlobalDoc, a cross-modal transformer-based architecture pre-trained in a self-supervised manner using three novel pretext objective tasks. GlobalDoc improves the learning of richer semantic concepts by unifying language and visual representations, resulting in more transferable models. For proper evaluation, we also propose two novel document-level downstream VDU tasks, Few-Shot Document Image Classification (DIC) and Content-based Document Image Retrieval (DIR), designed to simulate industrial scenarios more closely. Extensive experimentation has been conducted to demonstrate GlobalDoc’s effectiveness in practical settings.

arxiv情報

著者	Souhail Bakkali,Sanket Biswas,Zuheng Ming,Mickaël Coustaty,Marçal Rusiñol,Oriol Ramos Terrades,Josep Lladós
発行日	2024-11-05 15:18:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GlobalDoc: A Cross-Modal Vision-Language Framework for Real-World Document Image Retrieval and Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー