LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

要約

ラージビジョン言語モデル (LVLM) は文書理解機能を大幅に向上させ、複雑な文書要素、より長いコンテキスト、およびより広範囲のタスクの処理を可能にします。
しかし、既存の文書理解ベンチマークは少数のページのみの処理に限定されており、レイアウト要素の位置を包括的に分析することはできません。
このペーパーでは、最初に 3 つの主要なタスクカテゴリ (長い文書の理解、数値推論、および要素間の位置特定) を定義し、次に、上記 3 つの主要なタスクを統合し、異なる主要なタスクに基づいて分類された 20 のサブタスクで構成される包括的なベンチマーク LongDocURL を提案します。
タスクと回答の証拠。
さらに、半自動化された建設パイプラインを開発し、33,000 ページを超えるドキュメントをカバーする 2,325 の高品質な質問と回答のペアを収集し、既存のベンチマークを大幅に上回ります。
その後、26 の異なる構成にわたるオープンソースモデルとクローズドソースモデルの両方で包括的な評価実験を実施し、この分野における重大なパフォーマンスのギャップを明らかにしました。

要約(オリジナル)

Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.

arxiv情報

著者	Chao Deng,Jiale Yuan,Pi Bu,Peijie Wang,Zhong-Zhi Li,Jian Xu,Xiao-Hui Li,Yuan Gao,Jun Song,Bo Zheng,Cheng-Lin Liu
発行日	2024-12-24 13:39:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー