Output list
Preprint
What Can Human Sketches Do for Object Detection?
Posted to a preprint site 27/03/2023
Computer Vision and Pattern Recognition (CVPR), 2023
Sketches are highly expressive, inherently capturing subjective and
fine-grained visual cues. The exploration of such innate properties of human
sketches has, however, been limited to that of image retrieval. In this paper,
for the first time, we cultivate the expressiveness of sketches but for the
fundamental vision task of object detection. The end result is a sketch-enabled
object detection framework that detects based on what \textit{you} sketch --
\textit{that} ``zebra'' (e.g., one that is eating the grass) in a herd of
zebras (instance-aware detection), and only the \textit{part} (e.g., ``head" of
a ``zebra") that you desire (part-aware detection). We further dictate that our
model works without (i) knowing which category to expect at testing (zero-shot)
and (ii) not requiring additional bounding boxes (as per fully supervised) and
class labels (as per weakly supervised). Instead of devising a model from the
ground up, we show an intuitive synergy between foundation models (e.g., CLIP)
and existing sketch models build for sketch-based image retrieval (SBIR), which
can already elegantly solve the task -- CLIP to provide model generalisation,
and SBIR to bridge the (sketch$\rightarrow$photo) gap. In particular, we first
perform independent prompting on both sketch and photo branches of an SBIR
model to build highly generalisable sketch and photo encoders on the back of
the generalisation ability of CLIP. We then devise a training paradigm to adapt
the learned encoders for object detection, such that the region embeddings of
detected boxes are aligned with the sketch and photo embeddings from SBIR.
Evaluating our framework on standard object detection datasets like PASCAL-VOC
and MS-COCO outperforms both supervised (SOD) and weakly-supervised object
detectors (WSOD) on zero-shot setups. Project Page:
\url{https://pinakinathc.github.io/sketch-detect}
Preprint
Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR
Posted to a preprint site 23/03/2023
2023 Conference on Computer Vision and Pattern Recognition, 18/06/2023–22/06/2023, Vancouver, Canada
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR)
literature by putting forward a strong baseline that overshoots prior
state-of-the-arts by ~11%. This is not via complicated design though, but by
addressing two critical issues facing the community (i) the gold standard
triplet loss does not enforce holistic latent space geometry, and (ii) there
are never enough sketches to train a high accuracy model. For the former, we
propose a simple modification to the standard triplet loss, that explicitly
enforces separation amongst photos/sketch instances. For the latter, we put
forward a novel knowledge distillation module can leverage photo data for model
training. Both modules are then plugged into a novel plug-n-playable training
paradigm that allows for more stable training. More specifically, for (i) we
employ an intra-modal triplet loss amongst sketches to bring sketches of the
same instance closer from others, and one more amongst photos to push away
different photo instances while bringing closer a structurally augmented
version of the same photo (offering a gain of ~4-6%). To tackle (ii), we first
pre-train a teacher on the large set of unlabelled photos over the
aforementioned intra-modal photo triplet loss. Then we distill the contextual
similarity present amongst the instances in the teacher's embedding space to
that in the student's embedding space, by matching the distribution over
inter-feature distances of respective samples in both embedding spaces
(delivering a further gain of ~4-5%). Apart from outperforming prior arts
significantly, our model also yields satisfactory results on generalising to
new classes. Project page: https://aneeshan95.github.io/Sketch_PVT/
Preprint
CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not
Posted to a preprint site 23/03/2023
In this paper, we leverage CLIP for zero-shot sketch based image retrieval
(ZS-SBIR). We are largely inspired by recent advances on foundation models and
the unparalleled generalisation ability they seem to offer, but for the first
time tailor it to benefit the sketch community. We put forward novel designs on
how best to achieve this synergy, for both the category setting and the
fine-grained setting ("all"). At the very core of our solution is a prompt
learning setup. First we show just via factoring in sketch-specific prompts, we
already have a category-level ZS-SBIR system that overshoots all prior arts, by
a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR
synergy. Moving onto the fine-grained setup is however trickier, and requires a
deeper dive into this synergy. For that, we come up with two specific designs
to tackle the fine-grained matching nature of the problem: (i) an additional
regularisation loss to ensure the relative separation between sketches and
photos is uniform across categories, which is not the case for the gold
standard standalone triplet loss, and (ii) a clever patch shuffling technique
to help establishing instance-level structural correspondences between
sketch-photo pairs. With these designs, we again observe significant
performance gains in the region of 26.9% over previous state-of-the-art. The
take-home message, if any, is the proposed CLIP and prompt learning paradigm
carries great promise in tackling other sketch-related tasks (not limited to
ZS-SBIR) where data scarcity remains a great challenge. Project page:
https://aneeshan95.github.io/Sketch_LVM/
Preprint
Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings
Posted to a preprint site 20/03/2023
Human sketch has already proved its worth in various visual understanding
tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we
reveal a new trait of sketches - that they are also salient. This is intuitive
as sketching is a natural attentive process at its core. More specifically, we
aim to study how sketches can be used as a weak label to detect salient objects
present in an image. To this end, we propose a novel method that emphasises on
how "salient object" could be explained by hand-drawn sketches. To accomplish
this, we introduce a photo-to-sketch generation model that aims to generate
sequential sketch coordinates corresponding to a given visual photo through a
2D attention mechanism. Attention maps accumulated across the time steps give
rise to salient regions in the process. Extensive quantitative and qualitative
experiments prove our hypothesis and delineate how our sketch-based saliency
detection model gives a competitive performance compared to the
state-of-the-art.
Preprint
Picture that Sketch: Photorealistic Image Generation from Abstract Sketches
Posted to a preprint site 20/03/2023
IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, 18/06/2023–22/06/2023, Vancouver, BC, Canada
Given an abstract, deformed, ordinary sketch from untrained amateurs like you
and me, this paper turns it into a photorealistic image - just like those shown
in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in
that we do not dictate an edgemap-like sketch to start with, but aim to work
with abstract free-hand human sketches. In doing so, we essentially democratise
the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you
sketch. Our contribution at the outset is a decoupled encoder-decoder training
paradigm, where the decoder is a StyleGAN trained on photos only. This
importantly ensures that generated results are always photorealistic. The rest
is then all centred around how best to deal with the abstraction gap between
sketch and photo. For that, we propose an autoregressive sketch mapper trained
on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We
further introduce specific designs to tackle the abstract nature of human
sketches, including a fine-grained discriminative loss on the back of a trained
sketch-photo retrieval model, and a partial-aware sketch augmentation strategy.
Finally, we showcase a few downstream tasks our generation model enables,
amongst them is showing how fine-grained sketch-based image retrieval, a
well-studied problem in the sketch community, can be reduced to an image
(generated) to image retrieval task, surpassing state-of-the-arts. We put
forward generated results in the supplementary for everyone to scrutinise.
Preprint
Bi-directional Feature Reconstruction Network for Fine-Grained Few-Shot Image Classification
Posted to a preprint site 30/11/2022
The main challenge for fine-grained few-shot image classification is to learn
feature representations with higher inter-class and lower intra-class
variations, with a mere few labelled samples. Conventional few-shot learning
methods however cannot be naively adopted for this fine-grained setting -- a
quick pilot study reveals that they in fact push for the opposite (i.e., lower
inter-class variations and higher intra-class variations). To alleviate this
problem, prior works predominately use a support set to reconstruct the query
image and then utilize metric learning to determine its category. Upon careful
inspection, we further reveal that such unidirectional reconstruction methods
only help to increase inter-class variations and are not effective in tackling
intra-class variations. In this paper, we for the first time introduce a
bi-reconstruction mechanism that can simultaneously accommodate for inter-class
and intra-class variations. In addition to using the support set to reconstruct
the query set for increasing inter-class variations, we further use the query
set to reconstruct the support set for reducing intra-class variations. This
design effectively helps the model to explore more subtle and discriminative
features which is key for the fine-grained problem in hand. Furthermore, we
also construct a self-reconstruction module to work alongside the
bi-directional module to make the features even more discriminative.
Experimental results on three widely used fine-grained image classification
datasets consistently show considerable improvements compared with other
methods. Codes are available at: https://github.com/PRIS-CV/Bi-FRN.
Preprint
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Posted to a preprint site 25/04/2022
Computer Vision and Pattern Recognition (CVPR), 2023
In this paper, we extend scene understanding to include that of human sketch.
The result is a complete trilogy of scene representation from three diverse and
complementary modalities -- sketch, photo, and text. Instead of learning a
rigid three-way embedding and be done with it, we focus on learning a flexible
joint embedding that fully supports the ``optionality" that this
complementarity brings. Our embedding supports optionality on two axes: (i)
optionality across modalities -- use any combination of modalities as query for
downstream tasks like retrieval, (ii) optionality across tasks --
simultaneously utilising the embedding for either discriminative (e.g.,
retrieval) or generative tasks (e.g., captioning). This provides flexibility to
end-users by exploiting the best of each modality, therefore serving the very
purpose behind our proposal of a trilogy in the first place. First, a
combination of information-bottleneck and conditional invertible neural
networks disentangle the modality-specific component from modality-agnostic in
sketch, photo, and text. Second, the modality-agnostic instances from sketch,
photo, and text are synergised using a modified cross-attention. Once learned,
we show our embedding can accommodate a multi-facet of scene-related tasks,
including those enabled for the first time by the inclusion of sketch, all
without any task-specific modifications. Project Page:
\url{http://www.pinakinathc.me/scenetrilogy}
Preprint
Mind the Gap: Enlarging the Domain Gap in Open Set Domain Adaptation
Posted to a preprint site 08/03/2020
Unsupervised domain adaptation aims to leverage labeled data from a source
domain to learn a classifier for an unlabeled target domain. Among its many
variants, open set domain adaptation (OSDA) is perhaps the most challenging, as
it further assumes the presence of unknown classes in the target domain. In
this paper, we study OSDA with a particular focus on enriching its ability to
traverse across larger domain gaps. Firstly, we show that existing
state-of-the-art methods suffer a considerable performance drop in the presence
of larger domain gaps, especially on a new dataset (PACS) that we re-purposed
for OSDA. We then propose a novel framework to specifically address the larger
domain gaps. The key insight lies with how we exploit the mutually beneficial
information between two networks; (a) to separate samples of known and unknown
classes, (b) to maximize the domain confusion between source and target domain
without the influence of unknown samples. It follows that (a) and (b) will
mutually supervise each other and alternate until convergence. Extensive
experiments are conducted on Office-31, Office-Home, and PACS datasets,
demonstrating the superiority of our method in comparison to other
state-of-the-arts. Code available at
https://github.com/dongliangchang/Mutual-to-Separate/