Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

要約

時間的動作ローカリゼーション (TAL) の語彙サイズは、大規模な注釈付きデータセットの不足によって制限されます。
これに対処するために、最近の研究では、オープンボキャブラリー TAL (OV-TAL) を実行するために、CLIP などの強力な事前トレーニング済みビジョン言語モデル (VLM) が組み込まれています。
ただし、大規模な画像/ビデオとテキストのペアでトレーニングされた VLM とは異なり、既存の OV-TAL メソッドは、アクションローカライザーのトレーニングに依然として、完全にラベル付けされた小規模な TAL データセットに依存しています。
このペーパーでは、OV-TAL のラベルのない YouTube ビデオを使用した自己トレーニングのスケーラビリティを調査します。
私たちのセルフトレーニングアプローチは 2 つの段階で構成されます。
まず、クラスに依存しないアクションローカライザーが人間によってラベル付けされた TAL データセットでトレーニングされ、ラベルのないビデオの疑似ラベルを生成するために使用されます。
次に、大規模な擬似ラベル付きデータセットを人間がラベル付けしたデータセットと組み合わせて、ローカライザーをトレーニングします。
広範な実験により、自己トレーニングで Web スケールビデオを活用すると、アクションローカライザーの汎用性が大幅に向上することが実証されました。
さらに、既存の OV-TAL 評価スキームの問題点を強調し、新しい評価プロトコルを提案しました。
コードは https://github.com/HYUNJS/STOV-TAL で公開されています

要約(オリジナル)

The vocabulary size in temporal action localization (TAL) is constrained by the scarcity of large-scale annotated datasets. To address this, recent works incorporate powerful pre-trained vision-language models (VLMs), such as CLIP, to perform open-vocabulary TAL (OV-TAL). However, unlike VLMs trained on extensive image/video-text pairs, existing OV-TAL methods still rely on small, fully labeled TAL datasets for training an action localizer. In this paper, we explore the scalability of self-training with unlabeled YouTube videos for OV-TAL. Our self-training approach consists of two stages. First, a class-agnostic action localizer is trained on a human-labeled TAL dataset and used to generate pseudo-labels for unlabeled videos. Second, the large-scale pseudo-labeled dataset is combined with the human-labeled dataset to train the localizer. Extensive experiments demonstrate that leveraging web-scale videos in self-training significantly enhances the generalizability of an action localizer. Additionally, we highlighted issues with existing OV-TAL evaluation schemes and proposed a new evaluation protocol. Code is released at https://github.com/HYUNJS/STOV-TAL

arxiv情報

著者	Jeongseok Hyun,Su Ho Han,Hyolim Kang,Joon-Young Lee,Seon Joo Kim
発行日	2024-07-09 16:44:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー