Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

要約

弱教師付き時間アクションローカリゼーション (WTAL) は、カテゴリラベルのみを使用してアクションインスタンスを検出および分類することを学習します。
ほとんどの方法は、アクションローカリゼーション用のビデオ機能を生成するために、既製の分類ベースの事前トレーニング (CBP) を広く採用しています。
ただし、分類とローカリゼーションでは最適化の目的が異なるため、一時的にローカライズされた結果は深刻な不完全な問題に悩まされます。
追加の注釈なしでこの問題に取り組むために、このホワイトペーパーでは、Vision-Language Pre-training (VLP) から自由行動の知識を抽出することを検討します。
CBPの結果。
このような補完性を融合するために、それぞれCBPとVLPとして機能する2つのブランチを持つ新しい蒸留コラボレーションフレームワークを提案します。
このフレームワークは、デュアルブランチの代替トレーニング戦略によって最適化されています。
具体的には、B ステップ中に、CBP ブランチから信頼できるバックグラウンド疑似ラベルを抽出します。
一方、F ステップでは、信頼できるフォアグラウンド疑似ラベルが VLP ブランチから抽出されます。
その結果、2 つのブランチの補完性が効果的に融合され、強力な提携が促進されます。
THUMOS14 と ActivityNet1.2 に関する広範な実験とアブレーション研究により、私たちの方法が最先端の方法よりも大幅に優れていることが明らかになりました。

要約(オリジナル)

Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), since we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. And as a result, the dual-branch complementarity is effectively fused to promote a strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods.

arxiv情報

著者	Chen Ju,Kunhao Zheng,Jinxiang Liu,Peisen Zhao,Ya Zhang,Jianlong Chang,Yanfeng Wang,Qi Tian
発行日	2022-12-19 10:02:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilling Vision-Language Pre-training to Collaborate with Weakly-Supervised Temporal Action Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー