Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

要約

大規模な画像テキストのコントラストトレーニングにより、クリップのような事前訓練を受けたビジョン言語モデル（VLM）は、優れたオープンボキャブラリー認識能力を示しています。
ほとんどの既存のオープンボキャブラリーオブジェクト検出器は、事前に訓練されたVLMを利用して生成表現を達成しようとします。
F-VITは、事前に訓練されたビジュアルエンコーダーをバックボーンネットワークとして使用し、トレーニング中にフリーズします。
ただし、凍結した骨格は、表現を強化するためにラベルの付いたデータの恩恵を受けません。
したがって、VITフィーチュアに変化したマルチスケール畳み込みネットワーク（VMCNET）と名付けられた新しい2つのブランチバックボーンネットワーク設計を提案します。
VMCNETは、訓練可能な畳み込みブランチ、凍結した事前訓練を受けたVITブランチ、および特徴変調モジュールで構成されています。
訓練可能なCNNブランチは、ラベル付きデータで最適化でき、凍結した事前訓練を受けたVITブランチは、大規模なトレーニングから派生した表現能力を維持できます。
次に、提案された機能変調モジュールは、VITブランチの表現でマルチスケールのCNN機能を変調できます。
提案された混合構造により、検出器は新しいカテゴリを発見する可能性が高くなります。
2つの人気のあるベンチマークで評価されたこの方法は、新しいカテゴリの検出パフォーマンスを高め、ベースラインを上回ります。
ov-cocoでは、提案された方法は44.3 AP $ _ {50}^{\ mathrm {novel}} $ with vit-b/16および48.5 ap $ _ {50}^{\ mathrm {novel}} $ with vit with vitで達成します
-l/14。
OV-LVIでは、VIT-B/16およびVIT-L/14のVMCNETが27.8および38.4マップ$ _ {r} $に達します。

要約(オリジナル)

Owing to large-scale image-text contrastive training, pre-trained vision language model (VLM) like CLIP shows superior open-vocabulary recognition ability. Most existing open-vocabulary object detectors attempt to utilize the pre-trained VLM to attain generative representation. F-ViT uses the pre-trained visual encoder as the backbone network and freezes it during training. However, the frozen backbone doesn’t benefit from the labeled data to strengthen the representation. Therefore, we propose a novel two-branch backbone network design, named as ViT-Feature-Modulated Multi-Scale Convolutional network (VMCNet). VMCNet consists of a trainable convolutional branch, a frozen pre-trained ViT branch and a feature modulation module. The trainable CNN branch could be optimized with labeled data while the frozen pre-trained ViT branch could keep the representation ability derived from large-scale pre-training. Then, the proposed feature modulation module could modulate the multi-scale CNN features with the representations from ViT branch. With the proposed mixed structure, detector is more likely to discover novel categories. Evaluated on two popular benchmarks, our method boosts the detection performance on novel category and outperforms the baseline. On OV-COCO, the proposed method achieves 44.3 AP$_{50}^{\mathrm{novel}}$ with ViT-B/16 and 48.5 AP$_{50}^{\mathrm{novel}}$ with ViT-L/14. On OV-LVIS, VMCNet with ViT-B/16 and ViT-L/14 reaches 27.8 and 38.4 mAP$_{r}$.

arxiv情報

著者	Xiangyu Gao,Yu Dai,Benliu Qiu,Hongliang Li
発行日	2025-01-28 14:28:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー