Abstract
We advance sketch research to scenes with the first dataset of freehand scene
sketches, FS-COCO. With practical applications in mind, we collect sketches
that convey scene content well but can be sketched within a few minutes by a
person with any sketching skills. Our dataset comprises 10,000 freehand scene
vector sketches with per point space-time information by 100 non-expert
individuals, offering both object- and scene-level abstraction. Each sketch is
augmented with its text description. Using our dataset, we study for the first
time the problem of fine-grained image retrieval from freehand scene sketches
and sketch captions. We draw insights on: (i) Scene salience encoded in
sketches using the strokes temporal order; (ii) Performance comparison of image
retrieval from a scene sketch and an image caption; (iii) Complementarity of
information in sketches and image captions, as well as the potential benefit of
combining the two modalities. In addition, we extend a popular vector sketch
LSTM-based encoder to handle sketches with larger complexity than was supported
by previous work. Namely, we propose a hierarchical sketch decoder, which we
leverage at a sketch-specific "pre-text" task. Our dataset enables for the
first time research on freehand scene sketch understanding and its practical
applications.