TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

要約

文書質問応答 (DocVQA) やシーンテキスト分析など、テキスト中心のタスク向けに調整された大規模マルチモーダルモデル (LMM) である TextMonkey を紹介します。
私たちのアプローチでは、いくつかの側面にわたる強化が導入されています。ゼロ初期化によるシフトウィンドウアテンションを採用することで、より高い入力解像度でのクロスウィンドウ接続を実現し、初期トレーニングを安定させます。
画像には冗長なトークンが含まれている可能性があると仮説を立て、類似性を利用して重要なトークンを除外することで、トークンの長さを合理化できるだけでなく、モデルのパフォーマンスも向上させることができます。
さらに、テキストのスポッティングとグラウンディングを包含するようにモデルの機能を拡張し、応答に位置情報を組み込むことで、解釈可能性が向上し、幻覚が最小限に抑えられます。
さらに、TextMonkey を微調整して、スクリーンショットをクリックするためのコマンドを理解できるようにすることもできます。
全体として、私たちの手法はさまざまなベンチマークデータセット全体でパフォーマンスを著しく向上させ、シーンテキスト中心の VQA、ドキュメント指向 VQA、KIE でそれぞれ 5.2%、6.9%、2.8% の向上を達成し、特に OCRBench のスコアは 561 で、以下を上回りました。
文書理解のための以前のオープンソースの大規模マルチモーダルモデル。
コードは https://github.com/Yuliang-Liu/Monkey でリリースされます。

要約(オリジナル)

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model’s performance. Moreover, by expanding our model’s capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at https://github.com/Yuliang-Liu/Monkey.

arxiv情報

著者	Yuliang Liu,Biao Yang,Qiang Liu,Zhang Li,Zhiyin Ma,Shuo Zhang,Xiang Bai
発行日	2024-03-07 13:16:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー