Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

要約

大規模なデータセットで事前トレーニングされた深遠なビジョン基盤モデルであるセグメントエニシングモデル (SAM) は、一般的なセグメンテーションの境界を打ち破り、さまざまな下流アプリケーションを起動します。
このペーパーでは、階層テキストセグメンテーションに SAM を活用する統合モデルである Hi-SAM を紹介します。
Hi-SAMは、ストローク、単語、行、段落の4階層にわたるテキストの分割に優れ、レイアウト解析も実現します。
具体的には、まず、パラメーター効率の高い微調整アプローチを通じて、SAM を高品質のテキストストロークセグメンテーション (TSS) モデルに変換します。
この TSS モデルを使用して、テキストストロークラベルを半自動で繰り返し生成し、HierText データセット内の 4 つのテキスト階層にわたるラベルを統合します。
その後、これらの完全なラベルを使用して、カスタマイズされた階層マスクデコーダーを備えた TSS アーキテクチャに基づいて、エンドツーエンドのトレーニング可能な Hi-SAM を起動します。
推論中、Hi-SAM は自動マスク生成 (AMG) モードとプロンプト可能なセグメンテーションモードの両方を提供します。
AMG モードに関しては、Hi-SAM は最初にテキストストロークの前景マスクをセグメント化し、次に階層テキストマスク生成のために前景ポイントをサンプリングし、ついでにレイアウト分析を実行します。
プロンプト表示モードに関しては、Hi-SAM は 1 回のポイントクリックで単語、テキスト行、および段落のマスクを提供します。
実験結果は、TSS モデルの最先端のパフォーマンスを示しています。テキストストロークセグメンテーションでは、Total-Text で 84.86% fgIOU、TextSeg で 88.96% fgIOU です。
さらに、HierText の結合階層検出とレイアウト分析を行う以前のスペシャリストと比較して、Hi-SAM は大幅な改善を達成しました。テキスト行レベルで 4.73% PQ と 5.39% F1、段落レベルのレイアウトで 5.49% PQ と 7.39% F1
分析に必要なトレーニングエポックが 20 分の 1 になります。
コードは https://github.com/ymy-k/Hi-SAM で入手できます。

要約(オリジナル)

The Segment Anything Model (SAM), a profound vision foundation model pre-trained on a large-scale dataset, breaks the boundaries of general segmentation and sparks various downstream applications. This paper introduces Hi-SAM, a unified model leveraging SAM for hierarchical text segmentation. Hi-SAM excels in text segmentation across four hierarchies, including stroke, word, text-line, and paragraph, while realizing layout analysis as well. Specifically, we first turn SAM into a high-quality text stroke segmentation (TSS) model through a parameter-efficient fine-tuning approach. We use this TSS model to iteratively generate the text stroke labels in a semi-automatical manner, unifying labels across the four text hierarchies in the HierText dataset. Subsequently, with these complete labels, we launch the end-to-end trainable Hi-SAM based on the TSS architecture with a customized hierarchical mask decoder. During inference, Hi-SAM offers both automatic mask generation (AMG) mode and promptable segmentation mode. In terms of the AMG mode, Hi-SAM segments text stroke foreground masks initially, then samples foreground points for hierarchical text mask generation and achieves layout analysis in passing. As for the promptable mode, Hi-SAM provides word, text-line, and paragraph masks with a single point click. Experimental results show the state-of-the-art performance of our TSS model: 84.86% fgIOU on Total-Text and 88.96% fgIOU on TextSeg for text stroke segmentation. Moreover, compared to the previous specialist for joint hierarchical detection and layout analysis on HierText, Hi-SAM achieves significant improvements: 4.73% PQ and 5.39% F1 on the text-line level, 5.49% PQ and 7.39% F1 on the paragraph level layout analysis, requiring 20x fewer training epochs. The code is available at https://github.com/ymy-k/Hi-SAM.

arxiv情報

著者	Maoyuan Ye,Jing Zhang,Juhua Liu,Chenyu Liu,Baocai Yin,Cong Liu,Bo Du,Dacheng Tao
発行日	2024-01-31 15:10:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー