Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

要約

テキストが豊富な視覚コンテンツの増加に伴い、視覚文書の理解が不可欠になりました。
この分野は、特に複雑なレイアウトを備えた多様なドキュメントタイプ全体で、視覚的な知覚とテキストの理解を効果的に統合する必要があるため、大きな課題を提起します。
さらに、このドメインの既存の微調整データセットは、堅牢な理解のための詳細なコンテキスト情報を提供することに不足していることが多く、幻覚と視覚要素間の空間的関係の制限された理解につながります。
これらの課題に対処するために、Markdown、JSON、HTML、Tikzなどの適応的な生成のマークアップ言語を利用して、高度に構造化されたドキュメント表現を構築し、文脈に基づいた応答を提供する革新的なパイプラインを提案します。
ドキュメント解析のための約3.8mの事前トレーニングデータペアを含むDocmark-Pileの2つの微細粒子構造データセットと、根拠のある命令のための624kの微調整データアノテーションを備えたDocmark-Instructを紹介します。
広範な実験は、提案されたモデルが、複雑な視覚シナリオの高度な推論と理解能力を促進し、さまざまな視覚文書理解ベンチマークにわたって既存の最先端のMLLMを大幅に上回ることを示しています。
コードとモデルはhttps：// githubでリリースされます。
com/euphoria16/docmark。

要約(オリジナル)

Visual Document Understanding has become essential with the increase of text-rich visual content. This field poses significant challenges due to the need for effective integration of visual perception and textual comprehension, particularly across diverse document types with complex layouts. Moreover, existing fine-tuning datasets for this domain often fall short in providing the detailed contextual information for robust understanding, leading to hallucinations and limited comprehension of spatial relationships among visual elements. To address these challenges, we propose an innovative pipeline that utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML, and TiKZ, to build highly structured document representations and deliver contextually-grounded responses. We introduce two fine-grained structured datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data annotations for grounded instruction following. Extensive experiments demonstrate that our proposed model significantly outperforms existing state-of-theart MLLMs across a range of visual document understanding benchmarks, facilitating advanced reasoning and comprehension capabilities in complex visual scenarios. Our code and models are released at https://github. com/Euphoria16/DocMark.

arxiv情報

著者	Han Xiao,Yina Xie,Guanxin Tan,Yinghao Chen,Rui Hu,Ke Wang,Aojun Zhou,Hao Li,Hao Shao,Xudong Lu,Peng Gao,Yafei Wen,Xiaoxin Chen,Shuai Ren,Hongsheng Li
発行日	2025-05-08 17:37:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー