Abstract
Network convergence as well as recognition accuracy are essential issues when applying Convolutional Neural Networks (CNN) to human action recognition. Most deep learning methods neglect model convergence when striving to improve the abstraction capability, thus degrading the performances sharply when computing resources are limited. To mitigate this problem, we propose a structure named 2D Progressive Fusion (2DPF) Module which is inserted after the 2D backbone CNN layers. 2DPF fuses features through a novel 2D convolution on the spatial and temporal dimensions called variation attenuating convolution and applies fusion techniques to improve the recognition accuracy and the convergency. Our experiments performed on several benchmarks (e.g., Something-Something V1&V2, Kinetics400 & 600, AViD, UCF101) demonstrate the effectiveness of the proposed method.
ARTICLE INFO.