Towards Deployable OCR models for Indic languages

要約

サブワード分割を必要とせずに単語または行画像上のテキストを認識することが、インド言語のテキスト認識の研究開発の主流となっています。
コネクショニスト時間分類 (CTC) を使用してセグメント化されていないシーケンスをモデル化することは、セグメンテーションフリーの OCR で最も一般的に使用されるアプローチです。
この研究では、ニューラルネットワーク出力の段階的予測を Unicode シーケンスに転写するために CTC を使用する、さまざまなニューラルネットワークモデルの包括的な実証研究を紹介します。
この調査は、1 言語あたり約 1,000 ページある内部データセットを使用して、13 のインドの言語に対して実施されました。
私たちは、認識単位として行と単語の選択と、モデルをトレーニングするための合成データの使用を研究します。
私たちのモデルを、エンドツーエンドの文書画像認識用の一般に公開されている OCR ツールと比較します。
当社の認識モデルと既存のテキストセグメンテーションツールを採用したエンドツーエンドパイプラインは、13 言語中 8 言語でこれらの公開 OCR ツールよりも優れたパフォーマンスを発揮します。
また、インド言語の単語と行を認識するための Mozhi と呼ばれる新しい公開データセットも導入します。
このデータセットには、インドの 13 言語にわたる 120 万以上の注釈付き単語画像 (12 万テキスト行) が含まれています。
私たちのコード、トレーニング済みモデル、および Mozhi データセットは、http://cvit.iiit.ac.in/research/projects/cvit-projects/ で利用可能になります。

要約(オリジナル)

Recognition of text on word or line images, without the need for sub-word segmentation has become the mainstream of research and development of text recognition for Indian languages. Modelling unsegmented sequences using Connectionist Temporal Classification (CTC) is the most commonly used approach for segmentation-free OCR. In this work we present a comprehensive empirical study of various neural network models that uses CTC for transcribing step-wise predictions in the neural network output to a Unicode sequence. The study is conducted for 13 Indian languages, using an internal dataset that has around 1000 pages per language. We study the choice of line vs word as the recognition unit, and use of synthetic data to train the models. We compare our models with popular publicly available OCR tools for end-to-end document image recognition. Our end-to-end pipeline that employ our recognition models and existing text segmentation tools outperform these public OCR tools for 8 out of the 13 languages. We also introduce a new public dataset called Mozhi for word and line recognition in Indian language. The dataset contains more than 1.2 million annotated word images (120 thousand text lines) across 13 Indian languages. Our code, trained models and the Mozhi dataset will be made available at http://cvit.iiit.ac.in/research/projects/cvit-projects/

arxiv情報

著者	Minesh Mathew,Ajoy Mondal,CV Jawahar
発行日	2024-12-18 14:41:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Deployable OCR models for Indic languages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー