A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects

要約

ビデオシーンの解析（VSP）は、コンピュータービジョンの基礎として浮上し、ダイナミックシーンでの多様な視覚エンティティの同時セグメンテーション、認識、追跡を促進しました。
この調査では、ビデオセマンティックセグメンテーション（VSS）、ビデオインスタンスセグメンテーション（VIS）、ビデオパノプティックセグメンテーション（VPS）、ビデオトラッキングとセグメンテーション（VTS）、およびオープンボカリックビデオセグメンテーション（OVVS）を含む幅広いビジョンタスクをカバーするVSPの最近の進歩の総合的なレビューを提示します。
従来の手作りの特徴から、完全に畳み込み的なネットワークから最新の変圧器ベースのアーキテクチャに至るまでの現代の深い学習パラダイムへの進化を体系的に分析し、ローカルおよびグローバルな時間的コンテキストの両方をキャプチャする際の有効性を評価します。
さらに、我々のレビューでは、時間的一貫性の維持から複雑なシーンのダイナミクスの処理に至るまで、技術的な課題について批判的に説明し、現在のベンチマーク基準を形成したデータセットと評価メトリックの包括的な比較研究を提供します。
最先端の方法論の重要な貢献と欠点を蒸留することにより、この調査は、現実世界のアプリケーションにおけるVSPの堅牢性と適応性をさらに高めることを約束する新たな傾向と将来の研究方向を強調しています。

要約(オリジナル)

Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes. In this survey, we present a holistic review of recent advances in VSP, covering a wide array of vision tasks, including Video Semantic Segmentation (VSS), Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), as well as Video Tracking and Segmentation (VTS), and Open-Vocabulary Video Segmentation (OVVS). We systematically analyze the evolution from traditional hand-crafted features to modern deep learning paradigms — spanning from fully convolutional networks to the latest transformer-based architectures — and assess their effectiveness in capturing both local and global temporal contexts. Furthermore, our review critically discusses the technical challenges, ranging from maintaining temporal consistency to handling complex scene dynamics, and offers a comprehensive comparative study of datasets and evaluation metrics that have shaped current benchmarking standards. By distilling the key contributions and shortcomings of state-of-the-art methodologies, this survey highlights emerging trends and prospective research directions that promise to further elevate the robustness and adaptability of VSP in real-world applications.

arxiv情報

著者	Guohuan Xie,Syed Ariff Syed Hesham,Wenya Guo,Bing Li,Ming-Ming Cheng,Guolei Sun,Yun Liu
発行日	2025-06-16 14:39:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー