Abstract
Contrastive learning has achieved great success in skeleton-based action
recognition. However, most existing approaches encode the skeleton sequences as
entangled spatiotemporal representations and confine the contrasts to the same
level of representation. Instead, this paper introduces a novel contrastive
learning framework, namely Spatiotemporal Clues Disentanglement Network
(SCD-Net). Specifically, we integrate the decoupling module with a feature
extractor to derive explicit clues from spatial and temporal domains
respectively. As for the training of SCD-Net, with a constructed global anchor,
we encourage the interaction between the anchor and extracted clues. Further,
we propose a new masking strategy with structural constraints to strengthen the
contextual associations, leveraging the latest development from masked image
modelling into the proposed SCD-Net. We conduct extensive evaluations on the
NTU-RGB+D (60&120) and PKU-MMD (I&II) datasets, covering various downstream
tasks such as action recognition, action retrieval, transfer learning, and
semi-supervised learning. The experimental results demonstrate the
effectiveness of our method, which outperforms the existing state-of-the-art
(SOTA) approaches significantly.