Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

要約

現在普及しているビデオオブジェクトセグメンテーション (VOS) 方法は、通常、特徴を抽出した後、現在のフレームと参照フレームの間で密なマッチングを実行します。
一方では、分離モデリングはターゲット情報の伝播を高レベルの特徴空間でのみ制限します。
一方、ピクセル単位のマッチングでは、ターゲットの全体的な理解が欠如します。
これらの問題を克服するために、特徴、対応、圧縮メモリの 3 つの要素を共同モデリングするための統合 VOS フレームワーク (JointFormer という造語) を提案します。
コア設計はジョイントブロックで、注目の柔軟性を利用して、同時に特徴を抽出し、ターゲット情報を現在のトークンと圧縮メモリトークンに伝播します。
このスキームにより、広範な情報伝播と識別特徴学習の実行が可能になります。
長期的な時間的ターゲットの情報を組み込むために、圧縮メモリトークンのカスタマイズされたオンライン更新メカニズムも考案しました。これにより、時間的次元に沿った情報の流れが促進され、グローバルモデリング機能が向上します。
この設計の下で、私たちの手法は、DAVIS 2017 val/test-dev (89.7% および 87.6%) および YouTube-VOS 2018/2019 val (87.0% および 87.0%) ベンチマークで新しい最先端のパフォーマンスを達成し、既存の研究を上回ります。
大差で。

要約(オリジナル)

Current prevailing Video Object Segmentation (VOS) methods usually perform dense matching between the current and reference frames after extracting their features. One on hand, the decoupled modeling restricts the targets information propagation only at high-level feature space. On the other hand, the pixel-wise matching leads to a lack of holistic understanding of the targets. To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory. The core design is the Joint Block, utilizing the flexibility of attention to simultaneously extract feature and propagate the targets information to the current tokens and the compressed memory token. This scheme allows to perform extensive information propagation and discriminative feature learning. To incorporate the long-term temporal targets information, we also devise a customized online updating mechanism for the compressed memory token, which can prompt the information flow along the temporal dimension and thus improve the global modeling capability. Under the design, our method achieves a new state-of-art performance on DAVIS 2017 val/test-dev (89.7% and 87.6%) and YouTube-VOS 2018/2019 val (87.0% and 87.0%) benchmarks, outperforming existing works by a large margin.

arxiv情報

著者	Jiaming Zhang,Yutao Cui,Gangshan Wu,Limin Wang
発行日	2023-08-25 17:30:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー