Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection

要約

本論文では、ビデオオブジェクト検出のための時間情報を活用したオブジェクト提案の生成という新しいアイデアを提示する。最近のリージョンベースのビデオオブジェクト検出器における特徴量の集約は、単一フレームのRPNから生成される学習された提案に大きく依存している。これは、NMSのような付加的なコンポーネントをすぐに導入し、低品質なフレームでは信頼性の低い提案を生成する。これらの制約に対処するため、我々は、時間情報を利用するスパースR-CNNを用いた新しい映像オブジェクト検出パイプラインであるSparseVODを発表する。特に、Sparse R-CNNのダイナミックヘッドに2つのモジュールを導入する。まず、RoI提案の特徴を抽出するために、Temporal RoI Align操作に基づくTemporal Feature Extractionモジュールを追加する。第二に、シーケンスレベルの意味的集約に動機づけられ、検出前にオブジェクトの特徴表現を強化するために、注意誘導型意味的提案特徴集約モジュールを組み込む。提案するSparseVODは、複雑な後処理手法のオーバーヘッドを効果的に軽減し、全体のパイプラインをエンドツーエンドで学習可能にする。広範な実験により、我々の手法は、シングルフレームのSparse RCNNをmAPで8%〜9%大幅に改善することが示された。さらに、ResNet-50をバックボーンとするImageNet VIDデータセットにおいて、最先端の80.3%のmAPを達成したほか、我々のSparseVODは、IoU閾値の増加（IoU > 0.5）において既存の提案ベースの手法よりも大きなマージンで優れた性能を発揮することがわかった。

要約(オリジナル)

This paper presents the novel idea of generating object proposals by leveraging temporal information for video object detection. The feature aggregation in modern region-based video object detectors heavily relies on learned proposals generated from a single-frame RPN. This imminently introduces additional components like NMS and produces unreliable proposals on low-quality frames. To tackle these restrictions, we present SparseVOD, a novel video object detection pipeline that employs Sparse R-CNN to exploit temporal information. In particular, we introduce two modules in the dynamic head of Sparse R-CNN. First, the Temporal Feature Extraction module based on the Temporal RoI Align operation is added to extract the RoI proposal features. Second, motivated by sequence-level semantic aggregation, we incorporate the attention-guided Semantic Proposal Feature Aggregation module to enhance object feature representation before detection. The proposed SparseVOD effectively alleviates the overhead of complicated post-processing methods and makes the overall pipeline end-to-end trainable. Extensive experiments show that our method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP. Furthermore, besides achieving state-of-the-art 80.3% mAP on the ImageNet VID dataset with ResNet-50 backbone, our SparseVOD outperforms existing proposal-based methods by a significant margin on increasing IoU thresholds (IoU > 0.5).

arxiv情報

著者	Khurram Azeem Hashmi,Didier Stricker,Muhammamd Zeshan Afzal
発行日	2022-10-05 16:17:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー