A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

要約

最近、多くの研究で、OCR から派生したテキストと空間レイアウトを大規模言語モデル (LLM) に独占的に組み込むことが、文書理解タスクに非常に効果的であることが実証されました。
ただし、空間レイアウトとテキストを統合する既存の方法には、長すぎるテキストシーケンスが生成されたり、LLM の自己回帰特性を十分に活用できなかったりするなどの制限があります。
この研究では、ドキュメントを理解するために、大規模言語モデルにおけるレイアウトとテキストのインターリービング (LayTextLLM)} を導入します。
特に、LayTextLLM は、各境界ボックスを 1 つの埋め込みに投影し、それをテキストとインターリーブして、LLM の自己回帰特性を利用しながら長いシーケンスの問題を効率的に回避します。
LayTextLLM は、レイアウトとテキストデータの対話を合理化するだけでなく、主要情報抽出 (KIE) およびビジュアル質問応答 (VQA) のパフォーマンスも向上します。
包括的なベンチマーク評価により、以前の最先端の文書理解 MLLM と比較して、KIE タスクで 27.2%、VQA タスクで 12.0% 向上し、他の SOTA OCR ベースの LLM と比較して 15.1% 向上するなど、大幅な改善が明らかになりました。
KIE タスク。

要約(オリジナル)

Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. In particular, LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive benchmark evaluations reveal significant improvements, with a 27.2% increase on KIE tasks and 12.0% on VQA tasks compared to previous state-of-the-art document understanding MLLMs, as well as a 15.1% improvement over other SOTA OCR-based LLMs on KIE tasks.

arxiv情報

著者	Jinghui Lu,Haiyang Yu,Yanjie Wang,Yongjie Ye,Jingqun Tang,Ziwei Yang,Binghong Wu,Qi Liu,Hao Feng,Han Wang,Hao Liu,Can Huang
発行日	2024-07-24 11:45:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー