OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

要約

ドキュメントコンテンツの抽出は、コンピュータービジョン、特に大規模言語モデル (LLM) および検索拡張生成 (RAG) テクノロジーの高品質データのニーズを満たすために重要です。
しかし、現在の文書解析方法には、多様性と包括的な評価の点で大きな制限があります。
これらの課題に対処するために、自動ドキュメントコンテンツ抽出を促進するように設計された新しいマルチソースベンチマークである OmniDocBench を導入します。
OmniDocBench には、学術論文、教科書、スライドなど、9 種類の多様な文書から構成される、細心の注意を払って精選され、注釈が付けられた高品質の評価データセットが含まれています。
当社のベンチマークは、19 のレイアウトカテゴリラベルと 14 の属性ラベルを備えた柔軟で包括的な評価フレームワークを提供し、データセット全体、個々のモジュール、または特定のデータタイプにわたるマルチレベルの評価を可能にします。
OmniDocBench を使用して、既存のモジュラーパイプラインとマルチモーダルなエンドツーエンド手法の徹底的な比較分析を実行し、ドキュメントの多様性の処理と公正な評価の確保における限界を強調します。
OmniDocBench は、文書コンテンツ抽出分野に対する堅牢で多様かつ公正な評価基準を確立し、将来の進歩のための重要な洞察を提供し、文書解析テクノロジーの開発を促進します。
コードとデータセットは https://github.com/opendatalab/OmniDocBench で入手できます。

要約(オリジナル)

Document content extraction is crucial in computer vision, especially for meeting the high-quality data needs of large language models (LLMs) and retrieval-augmented generation (RAG) technologies. However, current document parsing methods suffer from significant limitations in terms of diversity and comprehensive evaluation. To address these challenges, we introduce OmniDocBench, a novel multi-source benchmark designed to advance automated document content extraction. OmniDocBench includes a meticulously curated and annotated high-quality evaluation dataset comprising nine diverse document types, such as academic papers, textbooks, slides, among others. Our benchmark provides a flexible and comprehensive evaluation framework with 19 layout category labels and 14 attribute labels, enabling multi-level assessments across entire datasets, individual modules, or specific data types. Using OmniDocBench, we perform an exhaustive comparative analysis of existing modular pipelines and multimodal end-to-end methods, highlighting their limitations in handling document diversity and ensuring fair evaluation. OmniDocBench establishes a robust, diverse, and fair evaluation standard for the document content extraction field, offering crucial insights for future advancements and fostering the development of document parsing technologies. The codes and dataset is available in https://github.com/opendatalab/OmniDocBench.

arxiv情報

著者	Linke Ouyang,Yuan Qu,Hongbin Zhou,Jiawei Zhu,Rui Zhang,Qunshu Lin,Bin Wang,Zhiyuan Zhao,Man Jiang,Xiaomeng Zhao,Jin Shi,Fan Wu,Pei Chu,Minghao Liu,Zhenxiang Li,Chao Xu,Bo Zhang,Botian Shi,Zhongying Tu,Conghui He
発行日	2024-12-10 16:05:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー