On the Hidden Mystery of OCR in Large Multimodal Models

要約

大規模モデルは、最近、自然言語処理とマルチモーダル視覚言語学習において主要な役割を果たしています。
ただし、テキスト関連の視覚的なタスクにおけるそれらの有効性は、比較的未解明のままです。
このペーパーでは、テキスト認識、シーンテキスト中心のビジュアル質問応答 (VQA)、ドキュメント指向 VQA、重要な情報抽出などのさまざまなテキスト関連の視覚タスクにおいて、GPT4V や Gemini などの大規模マルチモーダルモデルの包括的な評価を実施しました。
(KIE)、および手書き数式認識 (HMER)。
大規模マルチモーダルモデルにおける光学式文字認識 (OCR) 機能の評価を容易にするために、包括的な評価ベンチマークである OCRBench を提案します。
OCRBench には 29 のデータセットが含まれており、利用可能な最も包括的な OCR 評価ベンチマークになります。
さらに、私たちの研究では、特に多言語テキスト、手書きテキスト、非意味論的テキスト、および数式認識の処理において、これらのモデルの長所と短所の両方が明らかになりました。
最も重要なことは、この研究で提示されたベースライン結果は、ゼロショットマルチモーダル技術の強化を目的とした革新的な戦略の構想と評価のための基礎的な枠組みを提供できる可能性があることです。
評価パイプラインとベンチマークは、https://github.com/Yuliang-Liu/MultimodalOCR で入手できます。

要約(オリジナル)

Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. However, their effectiveness in text-related visual tasks remains relatively unexplored. In this paper, we conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks including Text Recognition, Scene Text-Centric Visual Question Answering (VQA), Document-Oriented VQA, Key Information Extraction (KIE), and Handwritten Mathematical Expression Recognition (HMER). To facilitate the assessment of Optical Character Recognition (OCR) capabilities in Large Multimodal Models, we propose OCRBench, a comprehensive evaluation benchmark. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available. Furthermore, our study reveals both the strengths and weaknesses of these models, particularly in handling multilingual text, handwritten text, non-semantic text, and mathematical expression recognition. Most importantly, the baseline results presented in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. The evaluation pipeline and benchmark are available at https://github.com/Yuliang-Liu/MultimodalOCR.

arxiv情報

著者	Yuliang Liu,Zhang Li,Mingxin Huang,Biao Yang,Wenwen Yu,Chunyuan Li,Xucheng Yin,Cheng-lin Liu,Lianwen Jin,Xiang Bai
発行日	2024-08-14 03:30:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

On the Hidden Mystery of OCR in Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー