Abstract
Anomaly detection with weakly supervised video-level labels is typically
formulated as a multiple instance learning (MIL) problem, in which we aim to
identify snippets containing abnormal events, with each video represented as a
bag of video snippets. Although current methods show effective detection
performance, their recognition of the positive instances, i.e., rare abnormal
snippets in the abnormal videos, is largely biased by the dominant negative
instances, especially when the abnormal events are subtle anomalies that
exhibit only small differences compared with normal events. This issue is
exacerbated in many methods that ignore important video temporal dependencies.
To address this issue, we introduce a novel and theoretically sound method,
named Robust Temporal Feature Magnitude learning (RTFM), which trains a feature
magnitude learning function to effectively recognise the positive instances,
substantially improving the robustness of the MIL approach to the negative
instances from abnormal videos. RTFM also adapts dilated convolutions and
self-attention mechanisms to capture long- and short-range temporal
dependencies to learn the feature magnitude more faithfully. Extensive
experiments show that the RTFM-enabled MIL model (i) outperforms several
state-of-the-art methods by a large margin on four benchmark data sets
(ShanghaiTech, UCF-Crime, XD-Violence and UCSD-Peds) and (ii) achieves
significantly improved subtle anomaly discriminability and sample efficiency.
Code is available at https://github.com/tianyu0207/RTFM.