Abstract
Spatiotemporal modeling is crucial for capturing motion information in videos for action recognition task. Despite of the promising progress in skeleton-based action recognition by graph convolutional networks (GCNs), the relative improvement of applying the classical attention mechanism has been limited. In this paper, we underline the importance of spatio-temporal interactions by proposing different categories of attention modules. Initially, we focus on providing an insight into different attention modules, the Spatial-wise Attention Module (SAM) and the Temporal-wise Attention Module (TAM), which model the contexts interdependencies in spatial and temporal dimensions respectively. Then, the Spatiotemporal Attention Module (STAM) explicitly leverages comprehensive dependency information by the feature fusion structure embedded in the framework, which is different from other action recognition models with additional information flow or complicated superposition of multiple existing attention modules. Given intermediate feature maps, STAM simultaneously infers the feature descriptors along the spatial and temporal dimensions. The fusion of the feature descriptors filters the input feature maps for adaptive feature refinement. Experimental results on NTU RGB+D and KineticsSkeleton datasets show consistent improvements in classification performance, demonstrating the merit and a wide applicability of STAM.