Abstract
Self-supervised learning (SSL) techniques have recently produced outstanding
results in learning visual representations from unlabeled videos. Despite the
importance of motion in supervised learning techniques for action recognition,
SSL methods often do not explicitly consider motion information in videos. To
address this issue, we propose MOFO (MOtion FOcused), a novel SSL method for
focusing representation learning on the motion area of a video, for action
recognition. MOFO automatically detects motion areas in videos and uses these
to guide the self-supervision task. We use a masked autoencoder which randomly
masks out a high proportion of the input sequence; we force a specified
percentage of the inside of the motion area to be masked and the remainder from
outside. We further incorporate motion information into the finetuning step to
emphasise motion in the downstream task. We demonstrate that our motion-focused
innovations can significantly boost the performance of the currently leading
SSL method (VideoMAE) for action recognition. Our method improves the recent
self-supervised Vision Transformer (ViT), VideoMAE, by achieving +2.6%, +2.1%,
+1.3% accuracy on Epic-Kitchens verb, noun and action classification,
respectively, and +4.7% accuracy on Something-Something V2 action
classification. Our proposed approach significantly improves the performance of
the current SSL method for action recognition, indicating the importance of
explicitly encoding motion in SSL.