Abstract
We present PAT, a transformer-based network that learns complex temporal
co-occurrence action dependencies in a video by exploiting multi-scale temporal
features. In existing methods, the self-attention mechanism in transformers
loses the temporal positional information, which is essential for robust action
detection. To address this issue, we (i) embed relative positional encoding in
the self-attention mechanism and (ii) exploit multi-scale temporal
relationships by designing a novel non hierarchical network, in contrast to the
recent transformer-based approaches that use a hierarchical structure. We argue
that joining the self-attention mechanism with multiple sub-sampling processes
in the hierarchical approaches results in increased loss of positional
information. We evaluate the performance of our proposed approach on two
challenging dense multi-label benchmark datasets, and show that PAT improves
the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and
MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art
mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation
studies to examine the impact of the different components of our proposed
network.