MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

要約

最先端のビデオオブジェクト検出方法は、スライディングウィンドウまたはメモリキューのいずれかのメモリ構造を維持し、アテンションメカニズムを使用して現在のフレームを強化します。
しかし、我々は、これらのメモリ構造は、次の 2 つの暗黙の操作のため、効率的または十分ではないと主張します。(1) 拡張のためにメモリ内のすべての機能を連結するため、膨大な計算コストが発生します。
(2) フレーム単位のメモリ更新により、メモリがより多くの時間情報をキャプチャすることができなくなります。
この論文では、MAMBAと呼ばれるメモリバンクを介したマルチレベル集約アーキテクチャを提案します。
具体的には、当社のメモリバンクは、既存の方法の欠点を解消するために 2 つの新しい操作を採用しています。(1) 計算コストを大幅に削減できる軽量のキーセット構築。
(2) きめ細かい機能ごとの更新戦略。これにより、ビデオ全体からの知識を利用できるようになります。
補完的なレベル、つまり特徴マップと提案からの特徴をより良く強化するために、マルチレベルの特徴を統合された方法で集約する一般化拡張操作 (GEO) をさらに提案します。
私たちは、困難な ImageNetVID データセットに対して広範な評価を実施します。
既存の最先端手法と比較して、当社の手法は速度と精度の両方で優れたパフォーマンスを実現します。
さらに注目すべきことに、MAMBA は ResNet-101 を使用した場合、12.6/9.1 FPS で 83.7/84.6% の mAP を達成します。
コードは https://github.com/guanxiongsun/vfe.pytorch で入手できます。

要約(オリジナル)

State-of-the-art video object detection methods maintain a memory structure, either a sliding window or a memory queue, to enhance the current frame using attention mechanisms. However, we argue that these memory structures are not efficient or sufficient because of two implied operations: (1) concatenating all features in memory for enhancement, leading to a heavy computational cost; (2) frame-wise memory updating, preventing the memory from capturing more temporal information. In this paper, we propose a multi-level aggregation architecture via memory bank called MAMBA. Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods: (1) light-weight key-set construction which can significantly reduce the computational cost; (2) fine-grained feature-wise updating strategy which enables our method to utilize knowledge from the whole video. To better enhance features from complementary levels, i.e., feature maps and proposals, we further propose a generalized enhancement operation (GEO) to aggregate multi-level features in a unified manner. We conduct extensive evaluations on the challenging ImageNetVID dataset. Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS with ResNet-101. Code is available at https://github.com/guanxiongsun/vfe.pytorch.

arxiv情報

著者	Guanxiong Sun,Yang Hua,Guosheng Hu,Neil Robertson
発行日	2024-02-01 18:43:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー