SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution

要約

ビデオの最初のフレームに注釈を付けるだけで将来のフレームをセグメント化する半教師ありビデオオブジェクトセグメンテーション (Semi-VOS) が、最近ますます注目を集めています。
既存のパイプラインの中でも、時系列情報を最大限に活用して高品質なセグメンテーション結果を得ることができるメモリマッチングベースのパイプラインが研究の主流になりつつあります。
このタイプの方法は有望なパフォーマンスを達成していますが、フレームワーク全体は依然として大きな計算オーバーヘッドに悩まされています。これは主に、高解像度の特徴マップと各カーネルフィルター間のフレームごとの密な畳み込み演算によって引き起こされます。
したがって、この研究では SpVOS と呼ばれる VOS のスパースベースラインを提案します。これは、VOS フレームワーク全体の計算コストを削減する新しいトリプルスパースコンボリューションを開発します。
設計されたトリプルゲートは、隣接するビデオフレーム間の空間的および時間的冗長性の両方を十分に考慮して、十分な識別能力を維持しながら、各ピクセルにスパース畳み込みを適用する方法を決定するためのトリプル決定を適応的に行い、各層の計算オーバーヘッドを制御します。
類似したオブジェクトを区別し、エラーの蓄積を回避します。
スパース制約を考慮した設計目標と組み合わせた混合スパーストレーニング戦略も、VOS セグメンテーションのパフォーマンスと計算コストのバランスを取るために開発されています。
実験は、DAVIS と Youtube-VOS を含む 2 つの主流の VOS データセットで行われます。
結果は、提案された SpVOS が他の最先端のスパース手法よりも優れたパフォーマンスを達成し、さらには同等のパフォーマンス、たとえば DAVIS-2017 (Youtube-VOS) 検証で 83.04% (79.29%) の総合スコアを維持していることを示しています。
一般的な非スパース VOS ベースライン (DAVIS-2017 では 82.88%、Youtube-VOS では 80.36%) で設定され、最大 42% の FLOP を節約し、リソースに制約のあるシナリオでも応用できる可能性を示しています。

要約(オリジナル)

Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.

arxiv情報

著者	Weihao Lin,Tao Chen,Chong Yu
発行日	2023-10-23 17:21:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー