Visual Text Processing: A Comprehensive Review and Unified Evaluation

要約

視覚テキストは、ドキュメント画像とシーン画像の両方で重要なコンポーネントであり、豊富なセマンティック情報を伝え、コンピュータービジョンコミュニティで大きな注目を集めています。
テキストの検出や認識などの従来のタスクを超えて、視覚的なテキスト処理は、テキスト画像の再構築やテキスト画像操作など、基礎モデルの出現によって駆動される急速な進歩を目撃しました。
大きな進歩にもかかわらず、テキストを一般的なオブジェクトと区別するユニークなプロパティによる課題は残ります。
これらの明確なテキスト特性を効果的にキャプチャして活用することは、堅牢な視覚的なテキスト処理モデルを開発するために不可欠です。
この調査では、2つの重要な質問に焦点を当てた視覚的なテキスト処理における最近の進歩に関する包括的な多面的な分析を紹介します。（1）さまざまな視覚テキスト処理タスクに最も適したテキスト機能は何ですか？
（2）これらの特徴的なテキスト機能を、どのようにして処理フレームワークに効果的に組み込むことができますか？
さらに、幅広い視覚的なテキスト処理データセットを含む新しいベンチマークであるVTPBenchを紹介します。
マルチモーダル大手言語モデル（MLLM）の高度な視覚品質評価機能を活用して、公正で信頼できる評価を確保するために設計された新しい評価メトリックであるVTPSCoreを提案します。
20を超える特定のモデルを使用した私たちの経験的研究は、現在の技術を改善するためのかなりの余地を明らかにしています。
私たちの目的は、この作業を視覚テキスト処理の動的な分野での将来の探求と革新を促進する基本的なリソースとして確立することです。
関連するリポジトリは、https：//github.com/shuyansy/visual-text-processing-surveyで入手できます。

要約(オリジナル)

Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at https://github.com/shuyansy/Visual-Text-Processing-survey.

arxiv情報

著者	Yan Shu,Weichao Zeng,Fangmin Zhao,Zeyu Chen,Zhenhang Li,Xiaomeng Yang,Yu Zhou,Paolo Rota,Xiang Bai,Lianwen Jin,Xu-Cheng Yin,Nicu Sebe
発行日	2025-04-30 14:19:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Visual Text Processing: A Comprehensive Review and Unified Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー