Towards Efficient Temporal Activity Detection from Videos

Sauradip Nag

doi:10.15126/thesis.900960

With the tremendous growth in the consumption of videos due to the proliferation of social-media-based applications, video understanding has become a pressing and challenging concern in computer vision. Temporal action detection (TAD) emerges as one of the most crucial and challenging problems within video understanding in computer vision, garnering significant attention in recent years due to its widespread applications in daily life. Notably, TAD has witnessed substantial progress, particularly with the recent advancements in deep learning. There is a growing demand for temporal action detection in untrimmed videos, considering the practicality of naturally occurring untrimmed videos. For lengthy untrimmed videos, temporal action detection primarily addresses two tasks: a) determining when the action occurs, including the start and end times, and b) identifying the category to which each proposal belongs (such as Waving, Climbing, or Basketball-Dunk). Given that a video may contain one or more action clips, temporal action detection aims to develop models and techniques that provide fundamental information for computer vision applications: What are the actions, and when do these actions occur? This task is commonly called action localization, temporal action localization, or action detection. While both action recognition and action detection are crucial aspects of video understanding, temporal action detection poses greater challenges than action recognition. The relationship between action recognition and action detection mirrors that of object recognition and object detection. However, due to the inclusion of temporal information, temporal action detection is a significantly more complex problem than object detection. The difficulties are as follows: a) temporal information: Because of the 1-dimension temporal series information, temporal action detection can’t use static image information. It must combine the information of temporal series. b) Unclear boundaries: Different from object detection, the boundaries of the object are usually obvious, so that we can mark out a more explicit bounding box for the object. However, there exists no sensible definition of the exact temporal extent of an action. So it’s impossible to assign an accurate boundary when the action starts and ends. c) Large temporal spans: The span of temporal action fragments can be huge. For example, waving hands can only take a few seconds, while climbing or cycling can last for tens of minutes. Their spans differ in length which makes them extremely difficult to extract proposals. d) Costly Annotation: To annotate a single video, the annotator has to watch the entire video once, before annotating the start and end point of the action, hence obtaining such annotations is very cost ineffective, thus generalizing it to the real-world data is a key aspect of this problem. In addition, in the open environment, many problems exist such as multiscale, multi-target, and camera movement. Temporal action detection is very close to our lives: it has extensive application prospects and social value in the fields of video summarization, public video surveillance, skill assessment and daily life security. So it has received a lot of attention in recent years. In the following few sections we will discuss the potential drawbacks of the existing TAD approaches and how we have attempted to resolve the shortcomings. Our first chapter (Chapter 3) starts by figuring out how efficiently we can detect actions without using the notorious proposal-based methods. We thus reformulate the conventional proposal-based TAD into a proposal-free design. We start by converting the problem of detection as a regression problem into a classification problem. We realize this using a 1-D segmentation mask instead of discrete start/end points. As a result, the training is faster and predictions are accurate. The second chapter (Chapter 4) discovers that the lack of labelled samples bottleneck TAD performance. This is largely due to an omni-present localization error propagation issue that persists within such networks. This is more prevalent in the semi-supervised setting where we have only partial labels. This problem stems from the model design wherein all the existing approaches are based on a localize-then-classify paradigm for TAD. We therefore introduce a single-stage design to mitigate the localization error and separate out the subtasks. We re-used the unlabelled samples to pretrain the overall network using novel pre-text tasks for a better model prior, while fine-tuning using labelled samples. Thirdly, we extend the second chapter to a more challenging few-shot TAD setting in Chapter 5, where only a few annotated videos per class are available. In such scenarios, one challenging aspect is intra-class variance, where the same class has diverse foreground frame distribution within the support set. Traditional few-shot TAD fails to handle this intra-class variance, which is far inferior to even semi-supervised TAD. We propose a transformer based solution that solves this intra-class variation with the help of self-attention. In addition to this, we investigated the role of multi-modal input (e.g video and text) in the support set, and observed that additional text modality helps to compensate for the data scarcity issue and also address the intra-class variance further thanks to large-scale pretraining. We realize this by meta-learning a novel video-to-text inversion module that helps bridge the vanilla and multi-modal few-shot setup. Finally, in our last chapter (Chapter 6), we extend the TAD problem in the zero-shot open vocabulary setup. We notice that the same localization error propagation problem in chapter 2 also extends to the zero-shot setting, where the error is significant due to the lack of any labeled samples. This lack of labelled samples can be compensated with the usage of large-scale pretrained vision-language models like CLIP. The base design of our network is a single-stage TAD model. However, CLIP was never designed for dense detection tasks in videos, hence we propose a representation masking concept which masks out the foreground and aligns it with the CLIP embeddings. The class-agnostic representation masks additionally makes the model better generalizable to unseen classes. As a result, our proposed design achieves SOTA for both open-set and closed-set TAD settings.

Towards Efficient Temporal Activity Detection from Videos

Abstract

Files and links (1)

Metrics

Details

Towards Efficient Temporal Activity Detection from Videos

Abstract

Files and links (1)

Metrics

Details

Usage Policy