Abstract
Cross-modal person re-identification (Re-ID) is a crucial component of a modern video surveillance
system and security infrastructure. The task of matching people across multiple nonoverlapping
camera views encompasses numerous computer vision challenges such as changes
in illumination, occlusions, pose variations, and even the absence of visual query. In this thesis,
we develop person ReID methods based on persons’ images and textual descriptions. The
key challenge is to align cross-modality feature representations according to the fine-grained
appearance attributes and ignore background noise.
The first contribution proposes to jointly model the multi-modal latent space, where corresponding
visual and textual representations are pushed closer. However, the performance of such
late-fusion models depends on the quality of the feature extraction backbones for each modality.
To overcome this issue, the second contribution asserts a unified cross-modal feature learning
backbone to implicitly align the shared semantic concepts from the start of the learning network.
Unified feature learning effectively utilizes textual data as a super-annotation signal for visual
representation learning and automatically rejects irrelevant information.
With the emergence of Vision transformers (ViTs), the idea of splitting a 2-D image into a 1-D
sequence of tokens, and learning long-range interactions solely via a self-attention mechanism
has further solidified the idea of a unified backbone model. In the final contribution, we propose
a vision transformer architecture design with the aim of an effective intra-modal and cross-modal
communication strategy based on special tokens. The purpose of these tokens is twofold. First,
these tokens encapsulate the image information into a small set of tokens. Second, the special
tokens are responsible for interacting across spatial windows of an image as well as across
modalities.
The proposed approach of multi-modal unified feature learning has the potential to address
the limitations of traditional single-modality person ReID methods and has important practical
implications in real-world video surveillance systems.