CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

要約

画像とテキストのペアに対する大規模なビジョン言語の事前トレーニングの恩恵を受けて、オープンワールド検出方法は、ゼロショットまたは少数ショット検出設定で優れた一般化能力を示しています。
ただし、既存のメソッドの推論段階では、定義済みのカテゴリ空間が依然として必要であり、その空間に属するオブジェクトのみが予測されます。
「実際の」オープンワールド検出器を導入するために、この論文では、特定のカテゴリリストの下で予測するか、予測された境界ボックスのカテゴリを直接生成する CapDet という名前の新しい方法を提案します。
具体的には、地域に基づいたキャプションを生成する追加の高密度キャプションヘッドを導入することで、オープンワールドの検出タスクと高密度キャプションタスクを 1 つの効果的なフレームワークに統合します。
さらに、キャプションタスクを追加すると、キャプションデータセットがより多くの概念をカバーするため、検出パフォーマンスの一般化に役立ちます。
実験結果は、高密度キャプションタスクを統合することにより、CapDet が LVIS のベースラインメソッド (1203 クラス) よりも大幅なパフォーマンスの向上 (たとえば、LVIS レアクラスで +2.1% mAP) を得たことを示しています。
さらに、当社の CapDet は高密度のキャプションタスクで最先端のパフォーマンスを達成します。たとえば、VG V1.2 で 15.44% の mAP、VG-COCO データセットで 13.98% です。

要約(オリジナル)

Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a ‘real’ open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset.

arxiv情報

著者	Yanxin Long,Youpeng Wen,Jianhua Han,Hang Xu,Pengzhen Ren,Wei Zhang,Shen Zhao,Xiaodan Liang
発行日	2023-03-15 13:45:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CapDet: Unifying Dense Captioning and Open-World Detection Pretraining

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー