MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

要約

我々は、ビデオベースのアーキテクチャも学習手順も用いずに、最先端のVIS性能を達成する最小限のビデオインスタンスセグメンテーション（VIS）フレームワークであるMinVISを提案する。MinVISは、クエリに基づく画像インスタンス分割モデルを学習するだけで、難易度の高いOccluded VISデータセットにおいて、従来の最良結果を10%以上上回る性能（AP）を達成する。MinVISは学習動画中のフレームを独立した画像として扱うため、学習動画中の注釈付きフレームを一切修正することなく、大幅にサブサンプル化することが可能である。わずか1%のラベル付きフレームで、MinVISはYouTube-VIS 2019/2021の完全教師ありの最先端アプローチを上回り、あるいは同等である。我々の重要な観察は、フレーム内のオブジェクトインスタンスを識別するように訓練されたクエリは、時間的に一貫しており、手動で設計されたヒューリスティックなしでインスタンスを追跡するために使用できることである。このため、MinVISは以下のような推論パイプラインを持つ。まず、訓練されたクエリに基づく画像インスタンス分割を、ビデオフレームに独立して適用する。次に、セグメント化されたインスタンスは、対応するクエリの二分割マッチングにより追跡される。この推論はオンライン方式で行われるため、ビデオ全体を一度に処理する必要はない。このようにMinVISは、VISの性能を犠牲にすることなく、ラベリングコストとメモリ要件の両方を削減するという実用的な利点を有している。コードは、https://github.com/NVlabs/MinVIS で公開されています。

要約(オリジナル)

We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following inference pipeline: we first apply the trained query-based image instance segmentation to video frames independently. The segmented instances are then tracked by bipartite matching of the corresponding queries. This inference is done in an online fashion and does not need to process the whole video at once. MinVIS thus has the practical advantages of reducing both the labeling costs and the memory requirements, while not sacrificing the VIS performance. Code is available at: https://github.com/NVlabs/MinVIS

arxiv情報

著者	De-An Huang,Zhiding Yu,Anima Anandkumar
発行日	2022-08-03 17:50:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー