End-to-End Video Text Spotting with Transformer

要約

最近のビデオテキストスポッティング手法では通常、3 段階のパイプラインが必要です。つまり、個々の画像内のテキストを検出し、ローカライズされたテキストを認識し、後処理でテキストストリームを追跡して最終結果を生成します。
これらの方法は通常、一致ごとの追跡パラダイムに従い、洗練されたパイプラインを開発します。
このホワイトペーパーでは、Transformer シーケンスモデリングに基づいて、シンプルだが効果的なエンドツーエンドのビデオテキスト検出、追跡、認識フレームワーク (TransDETR) を提案します。
TransDETR には主に 2 つの利点があります。1) 隣接フレームの明示的な一致パラダイムとは異なり、TransDETR はテキストクエリと呼ばれる異なるクエリによって、長距離の時系列 (7 フレーム以上) にわたって各テキストを暗黙的に追跡および認識します。
2) TransDETR は、最初のエンドツーエンドのトレーニング可能なビデオテキストスポッティングフレームワークであり、3 つのサブタスク (テキスト検出、追跡、認識など) に同時に対処します。
4 つのビデオテキストデータセット (つまり、ICDAR2013 ビデオ、ICDAR2015 ビデオ、ミネット、YouTube ビデオテキスト) で広範な実験が行われ、TransDETR がビデオテキストスポッティングタスクで最大約 8.0% の改善で最先端のパフォーマンスを達成することが実証されました。
.
TransDETR のコードは、https://github.com/weijiawu/TransDETR にあります。

要約(オリジナル)

Recent video text spotting methods usually require the three-staged pipeline, i.e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results. These methods typically follow the tracking-by-match paradigm and develop sophisticated pipelines. In this paper, rooted in Transformer sequence modeling, we propose a simple, but effective end-to-end video text DEtection, Tracking, and Recognition framework (TransDETR). TransDETR mainly includes two advantages: 1) Different from the explicit match paradigm in the adjacent frame, TransDETR tracks and recognizes each text implicitly by the different query termed text query over long-range temporal sequence (more than 7 frames). 2) TransDETR is the first end-to-end trainable video text spotting framework, which simultaneously addresses the three sub-tasks (e.g., text detection, tracking, recognition). Extensive experiments in four video text datasets (i.e.,ICDAR2013 Video, ICDAR2015 Video, Minetto, and YouTube Video Text) are conducted to demonstrate that TransDETR achieves state-of-the-art performance with up to around 8.0% improvements on video text spotting tasks. The code of TransDETR can be found at https://github.com/weijiawu/TransDETR.

arxiv情報

著者	Weijia Wu,Yuanqiang Cai,Chunhua Shen,Debing Zhang,Ying Fu,Hong Zhou,Ping Luo
発行日	2022-08-22 05:34:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

End-to-End Video Text Spotting with Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー