Learning Spatial-Semantic Features for Robust Video Object Segmentation

要約

長時間のビデオで複雑な部分や個別の部分を持つ複数の類似したオブジェクトを追跡してセグメント化することは、ターゲット部分のあいまいさと、オクルージョン、背景の乱雑さ、および長期的な変動によって引き起こされる同一性の混乱のため、本質的に困難です。
この論文では、上記の問題に対処するために、空間意味論的特徴と識別オブジェクトクエリを備えた堅牢なビデオオブジェクトセグメンテーションフレームワークを提案します。
具体的には、意味埋め込みブロックと空間依存関係モデリングブロックで構成される空間意味ネットワークを構築し、事前学習された ViT 特徴をグローバルセマンティック特徴およびローカル空間特徴と関連付け、包括的なターゲット表現を提供します。
さらに、マスクされたクロスアテンションモジュールを開発して、クエリ伝播中にターゲットオブジェクトの最も識別しやすい部分に焦点を当てたオブジェクトクエリを生成し、ノイズの蓄積を軽減し、効果的な長期クエリ伝播を保証します。
実験結果は、提案された方法が、DAVIS2017 テスト (89.1%)、YoutubeVOS 2019 (88.5%)、MOSE (75.1%)、LVOS テスト (73.0%) を含む複数のデータセットで新しい最先端のパフォーマンスを達成したことを示しています。
)、LVOS val (75.1%) は、提案された方法の有効性と一般化能力を示しています。
すべてのソースコードとトレーニング済みモデルを公開します。

要約(オリジナル)

Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations. In this paper, we propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries to address the above issues. Specifically, we construct a spatial-semantic network comprising a semantic embedding block and spatial dependencies modeling block to associate the pretrained ViT features with global semantic features and local spatial features, providing a comprehensive target representation. In addition, we develop a masked cross-attention module to generate object queries that focus on the most discriminative parts of target objects during query propagation, alleviating noise accumulation and ensuring effective long-term query propagation. The experimental results show that the proposed method set a new state-of-the-art performance on multiple datasets, including the DAVIS2017 test (89.1%), YoutubeVOS 2019 (88.5%), MOSE (75.1%), LVOS test (73.0%), and LVOS val (75.1%), which demonstrate the effectiveness and generalization capacity of the proposed method. We will make all source code and trained models publicly available.

arxiv情報

著者	Xin Li,Deshui Miao,Zhenyu He,Yaowei Wang,Huchuan Lu,Ming-Hsuan Yang
発行日	2024-07-10 15:36:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Spatial-Semantic Features for Robust Video Object Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー