Time-Space Transformers for Video Panoptic Segmentation

要約

ピクセルレベルのセマンティックとインスタンスのセグメンテーションを同時に予測し、クリップレベルのインスタンストラックを生成する、ビデオパノプティックセグメンテーションのタスクに対する新しいソリューションを提案します。
VPS-Transformer と名付けられた当社のネットワークは、最先端のパノプティックセグメンテーションネットワーク Panoptic-DeepLab に基づくハイブリッドアーキテクチャを備えており、シングルフレームパノプティックセグメンテーションの畳み込みアーキテクチャと、
純粋な Transformer ブロック。
Attention メカニズムを備えた Transformer は、現在および過去のフレームのバックボーン出力フィーチャ間の時空間関係をモデル化して、より正確で一貫したパノプティック推定を行います。
純粋な Transformer ブロックは、高解像度の画像を処理するときに大きな計算オーバーヘッドを導入するため、より効率的な計算のためにいくつかの設計変更を提案します。
時空ボリューム全体でより効果的に情報を集約する方法を研究し、Transformer ブロックのいくつかのバリアントを異なるアテンションスキームと比較します。
Cityscapes-VPS データセットでの広範な実験により、私たちの最良のモデルは時間の一貫性とビデオのパノプティック品質を 2.2% のマージンで改善し、余分な計算をほとんど行わないことが実証されています。

要約(オリジナル)

We propose a novel solution for the task of video panoptic segmentation, that simultaneously predicts pixel-level semantic and instance segmentation and generates clip-level instance tracks. Our network, named VPS-Transformer, with a hybrid architecture based on the state-of-the-art panoptic segmentation network Panoptic-DeepLab, combines a convolutional architecture for single-frame panoptic segmentation and a novel video module based on an instantiation of the pure Transformer block. The Transformer, equipped with attention mechanisms, models spatio-temporal relations between backbone output features of current and past frames for more accurate and consistent panoptic estimates. As the pure Transformer block introduces large computation overhead when processing high resolution images, we propose a few design changes for a more efficient compute. We study how to aggregate information more effectively over the space-time volume and we compare several variants of the Transformer block with different attention schemes. Extensive experiments on the Cityscapes-VPS dataset demonstrate that our best model improves the temporal consistency and video panoptic quality by a margin of 2.2%, with little extra computation.

arxiv情報

著者	Andra Petrovai,Sergiu Nedevschi
発行日	2022-10-07 13:30:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Time-Space Transformers for Video Panoptic Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー