Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

要約

ビジョントランスフォーマーは、画像ベースのタスクのパフォーマンスを向上させることに大きな成功を収めていますが、多言語テキストの視覚的な外観が複雑であるため、トランスフォーマーを多言語シーンテキスト認識に適用する研究はあまり報告されていません。
このギャップを埋めるために、この論文では、n グラムの埋め込みと言語間修正 (TANGER) を備えた拡張トランスフォーマーアーキテクチャを提案します。
TANGER は、ビジュアルイメージの単一パッチ埋め込みを使用したプライマリトランスフォーマーと、多言語シーンテキストからの特徴抽出に不可欠な、隣接するビジュアルパッチ間の潜在的な相関関係を柔軟に調査することを目的とした適応型 n グラム埋め込みを使用した補助トランスフォーマーで構成されます。
クロスランゲージ修正は、言語識別とコンテキストコヒーレンススコアリングの両方を考慮した損失関数によって実現されます。
広く使用されている 4 つのベンチマークデータセットと、インドネシアの観光シーンから収集されたインドネシア語、英語、中国語を含む新しい多言語シーンテキストデータセットについて、広範な比較研究が行われています。
私たちの実験結果は、特に複雑な多言語シーンテキストの処理において、TANGER が最新技術と比較してかなり優れていることを示しています。

要約(オリジナル)

While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to multilingual scene text recognition due to the complexities in the visual appearance of multilingual texts. To fill the gap, this paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER). TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings that aims to flexibly explore the potential correlations between neighbouring visual patches, which is essential for feature extraction from multilingual scene texts. Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring. Extensive comparative studies are conducted on four widely used benchmark datasets as well as a new multilingual scene text dataset containing Indonesian, English, and Chinese collected from tourism scenes in Indonesia. Our experimental results demonstrate that TANGER is considerably better compared to the state-of-the-art, especially in handling complex multilingual scene texts.

arxiv情報

著者	Xueming Yan,Zhihang Fang,Yaochu Jin
発行日	2023-02-28 02:37:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Augmented Transformers with Adaptive n-grams Embedding for Multilingual Scene Text Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー