Abstract
With the tremendous growth in the consumption of videos due to the proliferation
of social-media-based applications, video understanding has become
a pressing and challenging concern in computer vision. Temporal action
detection (TAD) emerges as one of the most crucial and challenging problems
within video understanding in computer vision, garnering significant
attention in recent years due to its widespread applications in daily life. Notably,
TAD has witnessed substantial progress, particularly with the recent
advancements in deep learning. There is a growing demand for temporal
action detection in untrimmed videos, considering the practicality of naturally
occurring untrimmed videos. For lengthy untrimmed videos, temporal
action detection primarily addresses two tasks: a) determining when the
action occurs, including the start and end times, and b) identifying the
category to which each proposal belongs (such as Waving, Climbing, or
Basketball-Dunk). Given that a video may contain one or more action clips,
temporal action detection aims to develop models and techniques that provide
fundamental information for computer vision applications: What are
the actions, and when do these actions occur? This task is commonly called
action localization, temporal action localization, or action detection. While
both action recognition and action detection are crucial aspects of video
understanding, temporal action detection poses greater challenges than action
recognition. The relationship between action recognition and action
detection mirrors that of object recognition and object detection. However,
due to the inclusion of temporal information, temporal action detection is
a significantly more complex problem than object detection. The difficulties are
as follows: a) temporal information: Because of the 1-dimension
temporal series information, temporal action detection can’t use static image
information. It must combine the information of temporal series. b)
Unclear boundaries: Different from object detection, the boundaries of the
object are usually obvious, so that we can mark out a more explicit bounding
box for the object. However, there exists no sensible definition of the
exact temporal extent of an action. So it’s impossible to assign an accurate
boundary when the action starts and ends. c) Large temporal spans: The
span of temporal action fragments can be huge. For example, waving hands
can only take a few seconds, while climbing or cycling can last for tens of
minutes. Their spans differ in length which makes them extremely difficult
to extract proposals. d) Costly Annotation: To annotate a single video, the
annotator has to watch the entire video once, before annotating the start
and end point of the action, hence obtaining such annotations is very cost
ineffective, thus generalizing it to the real-world data is a key aspect of this
problem. In addition, in the open environment, many problems exist such
as multiscale, multi-target, and camera movement. Temporal action detection
is very close to our lives: it has extensive application prospects and
social value in the fields of video summarization, public video surveillance,
skill assessment and daily life security. So it has received a lot of attention
in recent years. In the following few sections we will discuss the potential
drawbacks of the existing TAD approaches and how we have attempted to
resolve the shortcomings.
Our first chapter (Chapter 3) starts by figuring out how efficiently we
can detect actions without using the notorious proposal-based methods. We
thus reformulate the conventional proposal-based TAD into a proposal-free
design. We start by converting the problem of detection as a regression
problem into a classification problem. We realize this using a 1-D segmentation
mask instead of discrete start/end points. As a result, the training is
faster and predictions are accurate.
The second chapter (Chapter 4) discovers that the lack of labelled samples
bottleneck TAD performance. This is largely due to an omni-present
localization error propagation issue that persists within such networks. This
is more prevalent in the semi-supervised setting where we have only partial
labels. This problem stems from the model design wherein all the existing
approaches are based on a localize-then-classify paradigm for TAD. We
therefore introduce a single-stage design to mitigate the localization error
and separate out the subtasks. We re-used the unlabelled samples to
pretrain the overall network using novel pre-text tasks for a better model prior,
while fine-tuning using labelled samples.
Thirdly, we extend the second chapter to a more challenging few-shot
TAD setting in Chapter 5, where only a few annotated videos per class
are available. In such scenarios, one challenging aspect is intra-class
variance, where the same class has diverse foreground frame distribution within
the support set. Traditional few-shot TAD fails to handle this intra-class
variance, which is far inferior to even semi-supervised TAD. We propose a
transformer based solution that solves this intra-class variation with the help
of self-attention. In addition to this, we investigated the role of multi-modal
input (e.g video and text) in the support set, and observed that additional
text modality helps to compensate for the data scarcity issue and also address
the intra-class variance further thanks to large-scale pretraining. We
realize this by meta-learning a novel video-to-text inversion module that
helps bridge the vanilla and multi-modal few-shot setup.
Finally, in our last chapter (Chapter 6), we extend the TAD problem in
the zero-shot open vocabulary setup. We notice that the same localization
error propagation problem in chapter 2 also extends to the zero-shot setting,
where the error is significant due to the lack of any labeled samples. This
lack of labelled samples can be compensated with the usage of large-scale
pretrained vision-language models like CLIP. The base design of our network
is a single-stage TAD model. However, CLIP was never designed for dense
detection tasks in videos, hence we propose a representation masking concept
which masks out the foreground and aligns it with the CLIP embeddings.
The class-agnostic representation masks additionally makes the model better
generalizable to unseen classes. As a result, our proposed design achieves
SOTA for both open-set and closed-set TAD settings.