Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

要約

最近の物体検出器では、大規模なデータセットであらかじめ学習させた基幹ネットワークの利点が生かされている。しかし、基幹ネットワーク以外の検出器ヘッドや特徴ピラミッドネットワーク（FPN）などの構成要素は一から学習したままであり、表現モデルの潜在能力を十分に活用する妨げとなっている。本研究では、事前に学習したトランスフォーマーエンコーダデコーダ（imTED）を検出器に統合的に移行させ、「完全に事前学習された」特徴抽出経路を構築し、検出器の汎化能力を最大化することを提案する。imTEDとベースライン検出器の主な違いは、(1)事前に学習したトランスフォーマーデコーダを検出器ヘッドに移行し、特徴抽出経路からランダムな初期化FPNを取り除くこと、(2)マルチスケール特徴変調器を定義してスケール適応性を強化すること、の2つである。このような設計により、ランダムに初期化されるパラメータを大幅に削減するだけでなく、検出器学習と表現学習を意図的に一体化させることができる。MS COCOオブジェクト検出データセットでの実験では、imTEDは常に同種のデータセットを$2.4 AP上回る性能を持つことが示された。さらに、imTEDは、数発のオブジェクト検出において、最大7.6APの改善効果がある。コードは https://github.com/LiewFeng/imTED で入手できます。

要約(オリジナル)

Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders fully tapping the potential of representation models. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is “fully pre-trained’ so that detectors’ generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by $\sim$2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is available at https://github.com/LiewFeng/imTED.

arxiv情報

著者	Feng Liu,Xiaosong Zhang,Zhiliang Peng,Zonghao Guo,Fang Wan,Xiangyang Ji,Qixiang Ye
発行日	2022-12-02 14:57:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー