Abstract
Current polyp detection methods from colonoscopy videos use exclusively
normal (i.e., healthy) training images, which i) ignore the importance of
temporal information in consecutive video frames, and ii) lack knowledge about
the polyps. Consequently, they often have high detection errors, especially on
challenging polyp cases (e.g., small, flat, or partially visible polyps). In
this work, we formulate polyp detection as a weakly-supervised anomaly
detection task that uses video-level labelled training data to detect
frame-level polyps. In particular, we propose a novel convolutional
transformer-based multiple instance learning method designed to identify
abnormal frames (i.e., frames with polyps) from anomalous videos (i.e., videos
containing at least one frame with polyp). In our method, local and global
temporal dependencies are seamlessly captured while we simultaneously optimise
video and snippet-level anomaly scores. A contrastive snippet mining method is
also proposed to enable an effective modelling of the challenging polyp cases.
The resulting method achieves a detection accuracy that is substantially better
than current state-of-the-art approaches on a new large-scale colonoscopy video
dataset introduced in this work.