Towards Practicality of Sketch-Based Visual Understanding

AYAN KUMAR BHUNIA

doi:10.15126/thesis.900541

Sketches have been used to conceptualise and depict visual objects from pre-historic times. Sketch research has flourished in the past decade, particularly with the proliferation of touchscreen devices. Much of the utilisation of sketch has been anchored around the fact that it can be used to delineate visual concepts universally irrespective of age, race, language, or demography. The fine-grained interactive nature of sketches facilitates the application of sketches to various visual understanding tasks, like image retrieval, image-generation or editing, segmentation, 3D-shape modelling etc. However, sketches are highly abstract and subjective based on the perception of individuals. Although most agree that sketches provide fine-grained control to the user to depict a visual object, many consider sketching a tedious process due to their limited sketching skills compared to other query/support modalities like text/tags. Furthermore, collecting fine-grained sketch-photo association is a significant bottleneck to commercialising sketch applications. Therefore, this thesis aims to progress sketch-based visual understanding towards more practicality. Being able to model fine-grained details of a visual concept easily, sketch understandably holds immense potential as a medium of query – even better than texts that are at times insufficient to pin down fine-grained visual details. Out of all sketch-related applications therefore, fine-grained sketchbased image retrieval (FG-SBIR) has received the most attention, due to its significant commercial potential in the retail industry. FG-SBIR aims at retrieving a particular photo instance given a user’s query sketch, out of a gallery of photos having a particular category. Given the dominant prevalence of touchscreen devices now, the world is already primed for using sketch as a practical query modality for fine-grained retrieval. However, talking of applicability at an industrial scale, practically using sketch as a query-medium for retrieval is yet to gain traction due to a few significant barriers. Breaking these barriers, this thesis addresses the practicality of FG-SBIR via two themes putting forth five major contributions. The first theme comprises three contributions which focus particularly on the practical deployment of FG-SBIR which is one of the major forefront of sketch research. The second theme caters to the widespread applicability of sketches for real-world applications, consisting of two more contributions. In the first theme, our first chapter starts by figuring out that the widespread applicability of FG-SBIR is hindered as drawing a sketch takes time, and most people struggle to draw a complete and faithful sketch. We thus reformulate the conventional FG-SBIR framework to tackle these challenges, with the ultimate goal of retrieving the target photo with the least number of strokes possible. We further propose an on-the-fly design that starts retrieving as soon as the user starts drawing. Accordingly, we devise a reinforcement learning-based cross-modal retrieval framework that optimises the ground-truth photo’s rank over a complete sketch drawing episode. The second chapter discovers that the lack of sketch-photo pairs largely bottlenecks FG-SBIR performance. We therefore introduce a novel semisupervised framework for instance-level cross-modal retrieval that can leverage large-scale unlabelled photos to account for data scarcity. The core of our semi-supervision design is a sequential photo-to-sketch generation model that aims to generate paired sketches for unlabelled photos. We further introduce a discriminator guided mechanism to guide against unfaithful generation, together with a distillation loss-based regulariser to provide tolerance against noisy training samples. Thirdly, we notice that the fear-to-sketch problem (i.e., “I can’t sketch”) has proven to be fatal for the widespread adoption of fine-grained SBIR. A pilot study revealed that the secret lies in the existence of noisy strokes, but not so much of the “I can’t sketch”. We thus design a stroke subset selector that detects noisy strokes, leaving only those that positively contribute to a successful retrieval. Our Reinforcement Learning based formulation quantifies the importance of each stroke present in a given subset, based on the extent to which that stroke contributes to retrieval. Moving on to our second theme, in the fourth chapter we focus on learning powerful representations via self-supervised learning from unlabelled data, thus increasing the scope of sketches. Towards this, we advocate for the dual-modality nature of sketches – rasterized images and vector coordinate sequences, which is pivotal in designing a self-supervised pre-text task for our goal. We address this dual representation by proposing two novel cross-modal translation pre-text tasks for self-supervised feature learning: Vectorization and Rasterization. Vectorization learns to map image space to vector coordinates, and rasterization the opposite. We show that our learned encoder modules benefit both raster-based and vector-based downstream tasks to analysing hand-drawn data. The final chapter further contributes to the second theme by opening up a new avenue of sketch research with a novel sketch-based visual understanding task. Here we dictate the potential of sketch as a support modality for few-shot class incremental learning (FSCIL) – a setup that explicitly highlights the relevance of sketch as an information-modality. Pushing FSCIL further, we address two key questions that bottleneck its ubiquitous application (i) can the model learn from diverse modalities other than just photo (as humans do), and (ii) what if photos are not readily accessible (due to ethical and privacy constraints). The product is a “Doodle It Yourself” (DIY) FSCIL framework where the users can freely sketch a few examples of a novel class for the model to learn recognising photos of that class.

Towards Practicality of Sketch-Based Visual Understanding

Abstract

Files and links (1)

Metrics

Details

Towards Practicality of Sketch-Based Visual Understanding

Abstract

Files and links (1)

Metrics

Details

Usage Policy