Abstract
Existing Temporal Action Detection (TAD) methods typically take a
pre-processing step in converting an input varying-length video into a
fixed-length snippet representation sequence, before temporal boundary
estimation and action classification. This pre-processing step would temporally
downsample the video, reducing the inference resolution and hampering the
detection performance in the original temporal resolution. In essence, this is
due to a temporal quantization error introduced during the resolution
downsampling and recovery. This could negatively impact the TAD performance,
but is largely ignored by existing methods. To address this problem, in this
work we introduce a novel model-agnostic post-processing method without model
redesign and retraining. Specifically, we model the start and end points of
action instances with a Gaussian distribution for enabling temporal boundary
inference at a sub-snippet level. We further introduce an efficient
Taylor-expansion based approximation, dubbed as Gaussian Approximated
Post-processing (GAP). Extensive experiments demonstrate that our GAP can
consistently improve a wide variety of pre-trained off-the-shelf TAD models on
the challenging ActivityNet (+0.2% -0.7% in average mAP) and THUMOS (+0.2%
-0.5% in average mAP) benchmarks. Such performance gains are already
significant and highly comparable to those achieved by novel model designs.
Also, GAP can be integrated with model training for further performance gain.
Importantly, GAP enables lower temporal resolutions for more efficient
inference, facilitating low-resource applications. The code will be available
in https://github.com/sauradip/GAP