End-to-End Semi-Supervised Learning for Video Action Detection

要約

本研究では、ラベル付けされたデータとラベル付けされていないデータの両方を利用する、ビデオ行動検出のための半教師付き学習に焦点を当てる。我々は、ラベル付けされていないデータを効果的に利用する、シンプルなエンドツーエンドの一貫性に基づくアプローチを提案する。動画像行動検出には、行動クラスの予測と行動の時空間的な位置決めの両方が必要である。そこで、我々は分類の一貫性と時空間的な一貫性という2種類の制約を調査する。動画像には背景や静止した領域が多く存在するため、時空間的な整合性を利用してアクションを検出することは困難である。この問題に対処するため、我々は時空間整合性のための2つの新しい正則化制約を提案する；1）時間的一貫性、2）勾配平滑性である。これらはいずれも動画像における行動の時間的連続性を利用するものであり、ラベル付けされていない動画を行動検出に利用する際に有効であることが分かった。提案手法の有効性をUCF101-24とJHMDB-21という2種類の行動検出ベンチマークデータセットで実証する。また、Youtube-VOSにおいて、提案手法の汎用性を示すビデオオブジェクト分割の有効性を示す。提案手法は、近年の完全教師あり手法と比較して、UCF101-24ではわずか20%のアノテーションを用いることで競争力のある性能を達成することができる。UCF101-24 では，f-mAP と v-mAP が 0.5 のとき，教師あり手法と比較してそれぞれ +8.9% と +11% スコアが向上した．

要約(オリジナル)

In this work, we focus on semi-supervised learning for video action detection which utilizes both labeled as well as unlabeled data. We propose a simple end-to-end consistency based approach which effectively utilizes the unlabeled data. Video action detection requires both, action class prediction as well as a spatio-temporal localization of actions. Therefore, we investigate two types of constraints, classification consistency, and spatio-temporal consistency. The presence of predominant background and static regions in a video makes it challenging to utilize spatio-temporal consistency for action detection. To address this, we propose two novel regularization constraints for spatio-temporal consistency; 1) temporal coherency, and 2) gradient smoothness. Both these aspects exploit the temporal continuity of action in videos and are found to be effective for utilizing unlabeled videos for action detection. We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets, UCF101-24 and JHMDB-21. In addition, we also show the effectiveness of the proposed approach for video object segmentation on the Youtube-VOS which demonstrates its generalization capability The proposed approach achieves competitive performance by using merely 20% of annotations on UCF101-24 when compared with recent fully supervised methods. On UCF101-24, it improves the score by +8.9% and +11% at 0.5 f-mAP and v-mAP respectively, compared to supervised approach.

arxiv情報

著者	Akash Kumar,Yogesh Singh Rawat
発行日	2022-07-01 05:36:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

End-to-End Semi-Supervised Learning for Video Action Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー