Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

要約

スケルトンベースのアクション認識は、人間とコンピューターの対話の中心的なタスクです。
しかしながら、これまでの方法のほとんどは、次の 2 つの問題を抱えています。(i) 時空間情報の混合から生じる意味上の曖昧さ。
(ii) 潜在的なデータ分布 (つまり、クラス内変動およびクラス間関係) の明示的な利用を見落とし、それによってスケルトンエンコーダの局所的な最適解が導き出されます。
これを軽減するために、シーケンスから識別的で意味的に異なる表現を取得するための時空間分離対比学習 (STD-CL) フレームワークを提案します。これは、以前のほぼすべてのスケルトンエンコーダーに組み込むことができ、テスト時にスケルトンエンコーダーに影響を与えません。
。
具体的には、グローバル特徴を空間固有の特徴と時間固有の特徴に分離して、特徴の時空間結合を低減します。
さらに、潜在的なデータ分布を明示的に利用するために、対比学習に注意深い特徴を使用します。これは、正のペアから特徴を集め、負のペアを押しのけることによって、シーケンス間の意味論的関係をモデル化します。
広範な実験により、4 つのさまざまなスケルトンエンコーダ (HCN、2S-AGCN、CTR-GCN、および Hyperformer) を備えた STD-CL が、NTU60、NTU120、および NW-UCLA ベンチマークで確実な改善を達成することが示されています。
コードが公開されます。

要約(オリジナル)

Skeleton-based action recognition is a central task of human-computer interaction. However, most of the previous methods suffer from two issues: (i) semantic ambiguity arising from spatiotemporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to local optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into almost all previous skeleton encoders and have no impact on the skeleton encoders when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatiotemporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvement on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released.

arxiv情報

著者	Shaojie Zhang,Jianqin Yin,Yonghao Dang
発行日	2024-01-09 08:59:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー