Output list
Conference proceeding
Do Generalised Classifiers really work on Human Drawn Sketches?
Published 31/10/2024
Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LXII, 15114
European Conference on Computer Vision (ECCV), 29/09/2024–04/10/2024, Milan, Italy
This paper, for the first time, marries large foundation models with human sketch understanding. We demonstrate what this brings
– a paradigm shift in terms of generalised sketch representation learning (e.g., classification). This generalisation happens on two fronts: (i)
generalisation across unknown categories (i.e., open-set), and (ii) generalisation traversing abstraction levels (i.e., good and bad sketches),
both being timely challenges that remain unsolved in the sketch literature. Our design is intuitive and centred around transferring the already
stellar generalisation ability of CLIP to benefit generalised learning for
sketches. We first “condition” the vanilla CLIP model by learning sketchspecific prompts using a novel auxiliary head of raster to vector sketch
conversion. This importantly makes CLIP “sketch-aware”. We then make
CLIP acute to the inherently different sketch abstraction levels. This
is achieved by learning a codebook of abstraction-specific prompt biases, a weighted combination of which facilitates the representation of
sketches across abstraction levels – low abstract edge-maps, medium abstract sketches in TU-Berlin, and highly abstract doodles in QuickDraw.
Our framework surpasses popular sketch representation learning algorithms in both zero-shot and few-shot setups and in novel settings across
different abstraction boundaries.
Conference proceeding
Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes
First online publication 16/09/2024
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 17/06/2024–21/06/2024, Seattle, Washington, USA
In this paper, we democratise 3D content creation, enabling precise generation of 3D shapes from abstract sketches while overcoming limitations
tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder, our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions, eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally, our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space, our approach significantly reduces computational demands and processing time.
Conference proceeding
What Sketch Explainability Really Means for Downstream Tasks
First online publication 16/09/2024
IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 17/06/2024–21/06/2024, Seattle, Washington, USA
In this paper, we explore the unique modality of sketch for explainability,
emphasising the profound impact of human strokes compared to conventional
pixel-oriented studies. Beyond explanations of network behavior, we discern the
genuine implications of explainability across diverse downstream sketch-related
tasks. We propose a lightweight and portable explainability solution -- a
seamless plugin that integrates effortlessly with any pre-trained model,
eliminating the need for re-training. Demonstrating its adaptability, we
present four applications: highly studied retrieval and generation, and
completely novel assisted drawing and sketch adversarial attacks. The
centrepiece to our solution is a stroke-level attribution map that takes
different forms when linked with downstream tasks. By addressing the inherent
non-differentiability of rasterisation, we enable explanations at both coarse
stroke level (SLA) and partial stroke level (P-SLA), each with its advantages
for specific downstream tasks.
Conference proceeding
SketchINR: A First Look into Sketches as Implicit Neural Representations
First online publication 16/09/2024
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 17/06/2024–21/06/2024, Seattle, Washington, USA
We propose SketchINR, to advance the representation of vector sketches with
implicit neural models. A variable length vector sketch is compressed into a
latent space of fixed dimension that implicitly encodes the underlying shape as
a function of time and strokes. The learned function predicts the $xy$ point
coordinates in a sketch at each time and stroke. Despite its simplicity,
SketchINR outperforms existing representations at multiple tasks: (i) Encoding
an entire sketch dataset into a fixed size latent vector, SketchINR gives
$60\times$ and $10\times$ data compression over raster and vector sketches,
respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity
representation than other learned vector sketch representations, and is
uniquely able to scale to complex vector sketches such as FS-COCO. (iii)
SketchINR supports parallelisation that can decode/render $\sim$$100\times$
faster than other learned vector representations such as SketchRNN. (iv)
SketchINR, for the first time, emulates the human ability to reproduce a sketch
with varying abstraction in terms of number and complexity of strokes. As a
first look at implicit sketches, SketchINR's compact high-fidelity
representation will support future work in modelling long and complex sketches.
Doctoral Thesis
Exploring Sketch Traits for Democratising Sketch Based Image Retrieval
Degree award date 31/01/2024
Free-hand sketching, a mode of communication transcending age, nationality, and language barriers, has deep historical roots ingrained in human civilisation, which has become even easier now with the advent of touchscreen devices. Human expressivity conveyed via sketches, along with its ability to depict fine-grained details, encouraged numerous applications in vision tasks such as image retrieval, image-generation or editing, segmentation, etc. Thanks to sketch's ability of modelling fine-grained details of a human-query, and a large commercial potential, sketch-based image retrieval (SBIR) has flourished as one of the most research topics in sketch, which we address in this thesis.
Despite the growing acceptance of SBIR, challenges like diverse amateur sketch-styles and limited sketch data persist. Addressing these issues, this thesis presents five contributions across two themes -- addressing traits of sketches, enhancing retrieval accuracy and overcoming obstacles for application in real-world scenarios.
In our first theme, the first chapter addresses the ignored inherent hierarchical structure of sketches, proposing a cross-modal co-attention network that considers sketch-photo pairs at different abstraction levels. The second chapter tackles diversity in amateur styles, introducing a meta-learning-based variational auto-encoder network to disentangle style from semantics.
Transitioning to the second theme, the third chapter tackles data scarcity in fine-grained SBIR, where a knowledge-distillation paradigm utilises unlabelled photos to enrich the cross-modal embedding space, and a novel training paradigm improves stability and performance. Focusing on real-world scenarios, the fourth chapter delves into zero-shot SBIR, employing a meta-learning-based test-time training paradigm to adapt models during inference and reduce train-test distribution gaps.
The final chapter applies the CLIP foundation model for zero-shot SBIR. A prompt-learning setup is proposed to adapt CLIP, showcasing its potential in addressing data scarcity across diverse sketch-related tasks. The overarching message is the promise of foundation models in overcoming challenges related to data scarcity in various sketch applications.
Journal article
Bi-directional Feature Reconstruction Network for Fine-Grained Few-Shot Image Classification
Published 26/06/2023
Proceedings of the ... AAAI Conference on Artificial Intelligence, 37, 3, 2821 - 2829
The main challenge for fine-grained few-shot image classification is to learn feature representations with higher inter-class and lower intra-class variations, with a mere few labelled samples. Conventional few-shot learning methods however cannot be naively adopted for this fine-grained setting -- a quick pilot study reveals that they in fact push for the opposite (i.e., lower inter-class variations and higher intra-class variations). To alleviate this problem, prior works predominately use a support set to reconstruct the query image and then utilize metric learning to determine its category. Upon careful inspection, we further reveal that such unidirectional reconstruction methods only help to increase inter-class variations and are not effective in tackling intra-class variations. In this paper, we for the first time introduce a bi-reconstruction mechanism that can simultaneously accommodate for inter-class and intra-class variations. In addition to using the support set to reconstruct the query set for increasing inter-class variations, we further use the query set to reconstruct the support set for reducing intra-class variations. This design effectively helps the model to explore more subtle and discriminative features which is key for the fine-grained problem in hand. Furthermore, we also construct a self-reconstruction module to work alongside the bi-directional module to make the features even more discriminative. Experimental results on three widely used fine-grained image classification datasets consistently show considerable improvements compared with other methods. Codes are available at: https://github.com/PRIS-CV/Bi-FRN.
Conference proceeding
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
Published 06/2023
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6873 - 6883
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the-arts by ≈11 %. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of ≈4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of ≈ 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes. Project page: https://aneeshan95.github.io/Sketch_PVT/
Conference proceeding
CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
Published 06/2023
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2765 - 2775
In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ('all"}. At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that over-shoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and re-quires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe signifi-cant performance gains in the region of 26.9% over previ-ous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.ioISketchLVM/
Conference proceeding
Picture that Sketch: Photorealistic Image Generation from Abstract Sketches
Published 06/2023
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6850 - 6861
Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise. Project page: https://subhadeepkoley.github.io/PictureThatSketch
Conference proceeding
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Published 06/2023
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10972 - 10983
In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary modalities - sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, wefocus on learning a flexible joint embedding that fully supports the "optionality" that this complementarity brings. Our embedding supports optionality on two axes: (i) optionality across modalities - use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks - simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy in the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications. Project Page: https://pinakinathc.github.io/scenetrilogy