Instruction-Guided Scene Text Recognition

要約

指導ガイド付きトレーニングにより、きめ細かい視覚コンテンツを理解する能力が呼び起こされたため、マルチモーダルモデルは最近、視覚タスクにおいて魅力的なパフォーマンスを示しています。
ただし、自然画像とテキスト画像の間にはギャップがあるため、現在の方法をシーンテキスト認識 (STR) に簡単に適用することはできません。
本稿では、STRを命令学習問題として定式化する新しいパラダイムを紹介し、効果的なクロスモーダル学習を実現するための命令ガイド付きシーンテキスト認識（IGTR）を提案します。
IGTR はまず、<条件、質問、回答> の豊富で多様な命令の 3 つを生成し、微妙なテキスト画像の理解のためのガイドとして機能します。
次に、専用のクロスモーダル機能融合モジュールとマルチタスクアンサーヘッドを備えたアーキテクチャを考案し、質問に答えるために必要な指示と画像機能を効果的に融合します。
これらの設計に基づいて構築された IGTR は、文字の属性を理解することで正確なテキスト認識を容易にします。
英語と中国語のベンチマークでの実験では、IGTR が既存のモデルを大幅に上回るパフォーマンスを示しています。
さらに、IGTR は命令を調整することにより、さまざまな認識方式を可能にします。
これらには、文字認識を明確にターゲットにしていない命令に基づいてモデルがトレーニングされるゼロショット予測や、めったに出現しない形態学的に類似した文字の認識が含まれますが、これは既存のモデルの以前の課題でした。

要約(オリジナル)

Multi-modal models have shown appealing performance in visual tasks recently, as instruction-guided training has evoked the ability to understand fine-grained visual content. However, current methods cannot be trivially applied to scene text recognition (STR) due to the gap between natural and text images. In this paper, we introduce a novel paradigm that formulates STR as an instruction learning problem, and propose instruction-guided scene text recognition (IGTR) to achieve effective cross-modal learning. IGTR first generates rich and diverse instruction triplets of , serving as guidance for nuanced text image understanding. Then, we devise an architecture with dedicated cross-modal feature fusion module, and multi-task answer head to effectively fuse the required instruction and image features for answering questions. Built upon these designs, IGTR facilitates accurate text recognition by comprehending character attributes. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins. Furthermore, by adjusting the instructions, IGTR enables various recognition schemes. These include zero-shot prediction, where the model is trained based on instructions not explicitly targeting character recognition, and the recognition of rarely appearing and morphologically similar characters, which were previous challenges for existing models.

arxiv情報

著者	Yongkun Du,Zhineng Chen,Yuchen Su,Caiyan Jia,Yu-Gang Jiang
発行日	2024-01-31 14:13:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Instruction-Guided Scene Text Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー