Cross-Modal Person Re-identification

Ammarah Farooq

doi:10.15126/thesis.900775

Back

Doctoral Thesis

Open access

Cross-Modal Person Re-identification

Ammarah Farooq

Doctor of Philosophy (PhD), University of Surrey

31/08/2023

DOI:

https://doi.org/10.15126/thesis.900775

Abstract

Cross-modal person re-identification (Re-ID) is a crucial component of a modern video surveillance system and security infrastructure. The task of matching people across multiple nonoverlapping camera views encompasses numerous computer vision challenges such as changes in illumination, occlusions, pose variations, and even the absence of visual query. In this thesis, we develop person ReID methods based on persons’ images and textual descriptions. The key challenge is to align cross-modality feature representations according to the fine-grained appearance attributes and ignore background noise. The first contribution proposes to jointly model the multi-modal latent space, where corresponding visual and textual representations are pushed closer. However, the performance of such late-fusion models depends on the quality of the feature extraction backbones for each modality. To overcome this issue, the second contribution asserts a unified cross-modal feature learning backbone to implicitly align the shared semantic concepts from the start of the learning network. Unified feature learning effectively utilizes textual data as a super-annotation signal for visual representation learning and automatically rejects irrelevant information. With the emergence of Vision transformers (ViTs), the idea of splitting a 2-D image into a 1-D sequence of tokens, and learning long-range interactions solely via a self-attention mechanism has further solidified the idea of a unified backbone model. In the final contribution, we propose a vision transformer architecture design with the aim of an effective intra-modal and cross-modal communication strategy based on special tokens. The purpose of these tokens is twofold. First, these tokens encapsulate the image information into a small set of tokens. Second, the special tokens are responsible for interacting across spatial windows of an image as well as across modalities. The proposed approach of multi-modal unified feature learning has the potential to address the limitations of traditional single-modality person ReID methods and has important practical implications in real-world video surveillance systems.

Files and links (1)

pdf

PersonReID_using_Vision_Language_final_Ammarah Farooq8.07 MBDownload View

PDFCC BY-NC-SA V4.0, Open Access

Metrics

18 File views/ downloads

60 Record Views

Details

Title: Cross-Modal Person Re-identification
Creators: Ammarah Farooq - University of Surrey, School of Computer Science and Electronic Engineering
Contributors: Josef Vaclav Kittler (Supervisor) - University of Surrey, School of Computer Science and Electronic Engineering
Awarding Institution: University of Surrey; Doctor of Philosophy (PhD)
Theses and Dissertations: Doctor of Philosophy (PhD), University of Surrey
Identifiers: 99785965702346
Academic Unit: School of Computer Science and Electronic Engineering
Resource Type: Doctoral Thesis

Cross-Modal Person Re-identification

Abstract

Files and links (1)

Metrics

Details

Usage Policy