Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

要約

タイトル：Linguistic More：効率的かつ正確なシーンテキスト認識への更なる一歩

要約：
– シーンテキスト認識（STR）タスクにおいて、ビジョンモデルは単純さと効率性から注目を集めています。
– しかしながら、最近のビジョンモデルは言語的な知識や情報を欠如しているため、2つの問題があります：
– 純粋なビジョンベースのクエリは注意の漂いを引き起こし、通常これは認識率の低下の原因となり、本論文では「言語に対して無関心の漂い（LID）」問題としてまとめられます。
– 視覚的特徴は、一部のビジョン欠落の場合（例：遮蔽物など）に認識において最適化されません。
– これらの問題に対処するために、我々は「言語知覚ビジョンモデル（LPV）」を提案しており、ビジョンモデルの言語的能力を探求し、正確なテキスト認識を実現しています。
– LID問題の緩和のため、段階的最適化と言語情報の採掘を通じて、高品質かつ正確な注意マップを取得する「カスケードポジションアテンション（CPA）」メカニズムを導入しています。
– さらに、グローバル言語再構築モジュール（GLRM）を提案しており、視覚的空間の言語情報を認識することにより、視覚的特徴の表現を改善し、段階的に意味豊かな表現に変換しています。
– 以前の手法とは異なり、低複雑性を保ちながら、SOTAの結果を得ることができます。また、コードは https://github.com/CyrilSterling/LPV で入手可能です。

要約(オリジナル)

Vision model have gained increasing attention due to their simplicity and efficiency in Scene Text Recognition (STR) task. However, due to lacking the perception of linguistic knowledge and information, recent vision models suffer from two problems: (1) the pure vision-based query results in attention drift, which usually causes poor recognition and is summarized as linguistic insensitive drift (LID) problem in this paper. (2) the visual feature is suboptimal for the recognition in some vision-missing cases (e.g. occlusion, etc.). To address these issues, we propose a $\textbf{L}$inguistic $\textbf{P}$erception $\textbf{V}$ision model (LPV), which explores the linguistic capability of vision model for accurate text recognition. To alleviate the LID problem, we introduce a Cascade Position Attention (CPA) mechanism that obtains high-quality and accurate attention maps through step-wise optimization and linguistic information mining. Furthermore, a Global Linguistic Reconstruction Module (GLRM) is proposed to improve the representation of visual features by perceiving the linguistic information in the visual space, which gradually converts visual features into semantically rich ones during the cascade process. Different from previous methods, our method obtains SOTA results while keeping low complexity (92.4% accuracy with only 8.11M parameters). Code is available at https://github.com/CyrilSterling/LPV.

arxiv情報

著者	Boqiang Zhang,Hongtao Xie,Yuxin Wang,Jianjun Xu,Yongdong Zhang
発行日	2023-05-10 12:55:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

Linguistic More: Taking a Further Step toward Efficient and Accurate Scene Text Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー