CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

要約

画像とテキストのペアから信頼性の高い領域と単語のアライメントを導き出すことは、オープン語彙のオブジェクト検出のためのオブジェクトレベルの視覚言語表現を学習するために重要です。
既存の方法は通常、位置合わせのために事前にトレーニングされた、または自己トレーニングされた視覚言語モデルに依存しており、位置特定の精度や汎化機能に制限が生じる傾向があります。
この論文では、領域と単語のアライメントを共起するオブジェクト発見問題として再定式化することで、事前にアライメントされた視覚言語空間への依存を克服する新しいアプローチである CoDet を提案します。
直感的には、キャプションに共通の概念が記載されている画像をグループ化することにより、その共通の概念に対応するオブジェクトがグループ内で高い共起性を示すことになります。
次に、CoDet は視覚的な類似性を活用して、共起するオブジェクトを発見し、それらを共有の概念に合わせます。
広範な実験により、CoDet がオープン語彙検出において優れたパフォーマンスと説得力のあるスケーラビリティを備えていることが実証されています。たとえば、ビジュアルバックボーンをスケールアップすることにより、CoDet は 37.0 $\text{AP}^m_{novel}$ と 44.7 $\text{AP}^ を達成しました。
OV-LVIS では m_{all}$ であり、以前の SoTA を 4.2 $\text{AP}^m_{novel}$ および 9.8 $\text{AP}^m_{all}$ 上回っています。
コードは https://github.com/CVMI-Lab/CoDet で入手できます。

要約(オリジナル)

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

arxiv情報

著者	Chuofan Ma,Yi Jiang,Xin Wen,Zehuan Yuan,Xiaojuan Qi
発行日	2023-10-25 14:31:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー