Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

要約

モデルが広大なオープンワールドカテゴリを認識できるようにすることは、物体検出における長年の追求でした。
視覚言語モデルの一般化機能を活用することで、現在のオープンワールド検出器は、限られたカテゴリでトレーニングされているにもかかわらず、より広範囲の語彙を認識できます。
ただし、トレーニング中のカテゴリ語彙の規模が現実世界のレベルに拡大すると、大まかなクラス名に対応付けられた以前の分類器では、これらの検出器の認識パフォーマンスが大幅に低下します。
この論文では、膨大な語彙のオブジェクト検出のためのマルチモーダルプロトタイプ分類器である Prova を紹介します。
Prova は、膨大な語彙のオブジェクト認識失敗問題に取り組むため、アライメント分類子の初期化として包括的なマルチモーダルプロトタイプを抽出します。
V3Det では、この単純な方法により、教師あり設定とオープン語彙設定の両方で投影レイヤーを追加するだけで、1 段階、2 段階、および DETR ベースの検出器間のパフォーマンスが大幅に向上します。
特に、Prova は、V3Det の監視設定で、より高速な R-CNN、FCOS、および DINO をそれぞれ 3.3、6.2、および 2.9 AP 向上させます。
オープンボキャブラリー設定の場合、Prova は 32.8 のベース AP と 11.0 の新規 AP で新しい最先端のパフォーマンスを達成します。これは、以前の方法と比較して 2.6 および 4.3 のゲインです。

要約(オリジナル)

Enabling models to recognize vast open-world categories has been a longstanding pursuit in object detection. By leveraging the generalization capabilities of vision-language models, current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. However, when the scale of the category vocabularies during training expands to a real-world level, previous classifiers aligned with coarse class names significantly reduce the recognition performance of these detectors. In this paper, we introduce Prova, a multi-modal prototype classifier for vast-vocabulary object detection. Prova extracts comprehensive multi-modal prototypes as initialization of alignment classifiers to tackle the vast-vocabulary object recognition failure problem. On V3Det, this simple method greatly enhances the performance among one-stage, two-stage, and DETR-based detectors with only additional projection layers in both supervised and open-vocabulary settings. In particular, Prova improves Faster R-CNN, FCOS, and DINO by 3.3, 6.2, and 2.9 AP respectively in the supervised setting of V3Det. For the open-vocabulary setting, Prova achieves a new state-of-the-art performance with 32.8 base AP and 11.0 novel AP, which is of 2.6 and 4.3 gain over the previous methods.

arxiv情報

著者	Yitong Chen,Wenhao Yao,Lingchen Meng,Sihong Wu,Zuxuan Wu,Yu-Gang Jiang
発行日	2024-12-23 18:57:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー