Open-Vocabulary Panoptic Segmentation with MaskCLIP

要約

この論文では、テキストベースの説明の任意のカテゴリに対してパノプティックセグメンテーション (バックグラウンドセマンティックラベル付け + フォアグラウンドインスタンスセグメンテーション) を実行することを目的とした、新しいコンピュータービジョンタスクであるオープン語彙パノプティックセグメンテーションに取り組みます。
まず、既存の CLIP モデルの知識を利用するために、微調整も蒸留も行わないベースラインメソッドを構築します。
次に、新しいメソッド MaskCLIP を開発します。これは、ViT ベースの CLIP バックボーンでマスククエリを使用してセマンティックセグメンテーションとオブジェクトインスタンスセグメンテーションを実行する Transformer ベースのアプローチです。
ここでは、ViT CLIP モデルへの追加のトークンとしてセグメンテーションを考慮して、Relative Mask Attention (RMA) モジュールを設計します。
MaskCLIP は、外部の CLIP 画像モデルから画像パッチをトリミングして機能を計算するという時間のかかる操作を回避することにより、事前にトレーニングされた高密度/ローカル CLIP 機能を効率的かつ効果的に利用することを学習します。
ADE20K および PASCAL データセットでは、オープン語彙のパノプティックセグメンテーションで有望な結果が得られ、オープン語彙セマンティックセグメンテーションで最先端の結果が得られます。
カスタムカテゴリを使用した MaskCLIP の定性的な図を示します。

要約(オリジナル)

In this paper, we tackle a new computer vision task, open-vocabulary panoptic segmentation, that aims to perform panoptic segmentation (background semantic labeling + foreground instance segmentation) for arbitrary categories of text-based descriptions. We first build a baseline method without finetuning nor distillation to utilize the knowledge in the existing CLIP model. We then develop a new method, MaskCLIP, that is a Transformer-based approach using mask queries with the ViT-based CLIP backbone to perform semantic segmentation and object instance segmentation. Here we design a Relative Mask Attention (RMA) module to account for segmentations as additional tokens to the ViT CLIP model. MaskCLIP learns to efficiently and effectively utilize pre-trained dense/local CLIP features by avoiding the time-consuming operation to crop image patches and compute feature from an external CLIP image model. We obtain encouraging results for open-vocabulary panoptic segmentation and state-of-the-art results for open-vocabulary semantic segmentation on ADE20K and PASCAL datasets. We show qualitative illustration for MaskCLIP with custom categories.

arxiv情報

著者	Zheng Ding,Jieke Wang,Zhuowen Tu
発行日	2022-08-18 17:55:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Open-Vocabulary Panoptic Segmentation with MaskCLIP

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー