VRDU: A Benchmark for Visually-rich Document Understanding

要約

視覚的に豊富なビジネスドキュメントを理解して構造化データを抽出し、ビジネスワークフローを自動化することは、学界と産業界の両方で注目を集めています。
最近のマルチモーダル言語モデルは目覚ましい結果を達成しましたが、既存のベンチマークは業界で見られる実際のドキュメントの複雑さを反映していないことがわかりました。
この作業では、より包括的なベンチマークに対する要望を特定し、Visually Rich Document Understanding (VRDU) と呼ぶベンチマークを提案します。
VRDU には、さまざまなデータタイプと階層エンティティを含む豊富なスキーマ、テーブルや複数列レイアウトを含む複雑なテンプレート、および単一のドキュメントタイプ内のさまざまなレイアウト (テンプレート) という、いくつかの課題を表す 2 つのデータセットが含まれています。
抽出結果を評価するために、慎重に設計されたマッチングアルゴリズムとともに、少数ショットおよび従来の実験設定を設計します。
私たちは強力なベースラインのパフォーマンスを報告し、次の 3 つの観察結果を提供します: (1) 新しいドキュメントテンプレートへの一般化は依然として非常に困難、(2) 少数ショットのパフォーマンスには大きな余裕がある、(3) モデルはラインなどの階層フィールドに苦戦している
-請求書の項目。
私たちはベンチマークと評価ツールキットをオープンソース化する予定です。
これが、視覚的に豊富なドキュメントから構造化データを抽出するという困難なタスクでコミュニティが前進するのに役立つことを願っています。

要約(オリジナル)

Understanding visually-rich business documents to extract structured data and automate business workflows has been receiving attention both in academia and industry. Although recent multi-modal language models have achieved impressive results, we find that existing benchmarks do not reflect the complexity of real documents seen in industry. In this work, we identify the desiderata for a more comprehensive benchmark and propose one we call Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types as well as hierarchical entities, complex templates including tables and multi-column layouts, and diversity of different layouts (templates) within a single document type. We design few-shot and conventional experiment settings along with a carefully designed matching algorithm to evaluate extraction results. We report the performance of strong baselines and offer three observations: (1) generalizing to new document templates is still very challenging, (2) few-shot performance has a lot of headroom, and (3) models struggle with hierarchical fields such as line-items in an invoice. We plan to open source the benchmark and the evaluation toolkit. We hope this helps the community make progress on these challenging tasks in extracting structured data from visually rich documents.

arxiv情報

著者	Zilong Wang,Yichao Zhou,Wei Wei,Chen-Yu Lee,Sandeep Tata
発行日	2023-06-20 21:34:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VRDU: A Benchmark for Visually-rich Document Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー