Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

要約

参照ビデオセグメンテーションは、自然言語表現に依存してオブジェクトを識別し、セグメンテーションするが、多くの場合、動きの手がかりを強調する。これまでの研究では、文章を全体として扱い、静的な画像レベルの手がかりと時間的な動きの手がかりを混ぜて、ビデオレベルで直接識別を行う。しかし、画像レベルの特徴では文章中の動きの手がかりをうまく理解することはできず、静的な手がかりは時間的な知覚にとって重要ではない。実際、静的な手がかりは、動きの手がかりを覆い隠してしまうことで、時間的知覚を妨害することがある。本研究では、ビデオレベルの参照表現理解を静的認識と運動認識に分離することを提案し、特に時間的理解を強化することに重点を置く。まず、静的な手がかりと動きの手がかりがそれぞれ異なる役割を果たすように、表現分離モジュールを導入し、文埋め込みが動きの手がかりを見落とす問題を緩和する。次に、様々な時間スケールで時間情報を効果的に捉えるために、階層的な運動知覚モジュールを提案する。さらに、視覚的に類似した物体の動きを区別するために、対比学習を採用する。これらの貢献により、挑戦的な$mathcal{J&F}$データセットで顕著な$mathcal{J&F}$改善を含む、5つのデータセットで最先端の性能が得られる。コードはhttps://github.com/heshuting555/DsHmp。

要約(オリジナル)

Referring video segmentation relies on natural language expressions to identify and segment objects, often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level, mixing up static image-level cues with temporal motion cues. However, image-level features cannot well comprehend motion cues in sentences, and static cues are not crucial for temporal perception. In fact, static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work, we propose to decouple video-level referring expression understanding into static and motion perception, with a specific emphasis on enhancing temporal comprehension. Firstly, we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role, alleviating the issue of sentence embeddings overlooking motion cues. Secondly, we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore, we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets, including a remarkable $\textbf{9.2%}$ $\mathcal{J\&F}$ improvement on the challenging $\textbf{MeViS}$ dataset. Code is available at https://github.com/heshuting555/DsHmp.

arxiv情報

著者	Shuting He,Henghui Ding
発行日	2024-04-04 17:58:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー