Abstract
Free-hand sketching, a mode of communication transcending age, nationality, and language barriers, has deep historical roots ingrained in human civilisation, which has become even easier now with the advent of touchscreen devices. Human expressivity conveyed via sketches, along with its ability to depict fine-grained details, encouraged numerous applications in vision tasks such as image retrieval, image-generation or editing, segmentation, etc. Thanks to sketch's ability of modelling fine-grained details of a human-query, and a large commercial potential, sketch-based image retrieval (SBIR) has flourished as one of the most research topics in sketch, which we address in this thesis.
Despite the growing acceptance of SBIR, challenges like diverse amateur sketch-styles and limited sketch data persist. Addressing these issues, this thesis presents five contributions across two themes -- addressing traits of sketches, enhancing retrieval accuracy and overcoming obstacles for application in real-world scenarios.
In our first theme, the first chapter addresses the ignored inherent hierarchical structure of sketches, proposing a cross-modal co-attention network that considers sketch-photo pairs at different abstraction levels. The second chapter tackles diversity in amateur styles, introducing a meta-learning-based variational auto-encoder network to disentangle style from semantics.
Transitioning to the second theme, the third chapter tackles data scarcity in fine-grained SBIR, where a knowledge-distillation paradigm utilises unlabelled photos to enrich the cross-modal embedding space, and a novel training paradigm improves stability and performance. Focusing on real-world scenarios, the fourth chapter delves into zero-shot SBIR, employing a meta-learning-based test-time training paradigm to adapt models during inference and reduce train-test distribution gaps.
The final chapter applies the CLIP foundation model for zero-shot SBIR. A prompt-learning setup is proposed to adapt CLIP, showcasing its potential in addressing data scarcity across diverse sketch-related tasks. The overarching message is the promise of foundation models in overcoming challenges related to data scarcity in various sketch applications.