Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

要約

この作業では、インスタンスレベルのオープン語彙セグメンテーションに焦点を当て、マスクアノテーションなしでインスタンスごとの新しいカテゴリのセグメンターを拡張することを意図しています。
画像キャプションの助けを借りて、シンプルで効果的なフレームワークを調査し、キャプション内の何千ものオブジェクト名詞を活用して新しいクラスのインスタンスを発見することに焦点を当てています。
事前トレーニング済みのキャプションモデルを採用したり、複雑なパイプラインで大規模なキャプションデータセットを使用したりするのではなく、キャプショングラウンディングとキャプション生成の 2 つの側面からエンドツーエンドのソリューションを提案します。
特に、Mask Transformer ベースラインに基づいて、キャプショングラウンディングアンドジェネレーション (CGG) の共同フレームワークを考案します。
フレームワークには、明示的および暗黙的なマルチモーダル機能の配置を実行する新しいグラウンディングロスがあります。
さらに、軽量のキャプション生成ヘッドを設計して、追加のキャプション監視を可能にします。
グラウンディングと生成が互いに補完し合い、新しいカテゴリのセグメンテーションパフォーマンスが大幅に向上することがわかりました。
オープンボキャブラリーインスタンスセグメンテーション (OVIS) とオープンセットパノプティックセグメンテーション (OSPS) の 2 つの設定を使用して、COCO データセットで広範な実験を行います。
結果は、以前の OVIS メソッドに対する CGG フレームワークの優位性を示しており、追加のキャプションデータなしで新しいクラスで 6.8% の mAP という大幅な改善を達成しています。
また、私たちの方法は、さまざまな設定の下で、OSPS ベンチマークの新しいクラスの PQ を 15% 以上改善します。

要約(オリジナル)

In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.

arxiv情報

著者	Jianzong Wu,Xiangtai Li,Henghui Ding,Xia Li,Guangliang Cheng,Yunhai Tong,Chen Change Loy
発行日	2023-01-02 18:52:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー