Abstract
We conduct a detailed study of the ability of pretrained on pretext tasks ViT
and ResNet feature layers to quantify the similarity between pairs of 2D sketch
views of individual 3D shapes. We assess the performance in terms of the
models' abilities to retrieve similar views and ground-truth 3D shapes. Going
beyond naive zero-shot performance study, we investigate alternative
fine-tuning strategies on one or several shape classes, and their
generalization to other shape classes. Leveraging progress in NPR (Non-Photo
Realistic) rendering, we generate synthetic sketch views in several styles
which we use to fine-tune pretrained foundation models using contrastive
learning. We study how the scale of an object in a sketch affects the
similarity of features at different network layers. We observe that depending
on the scale, different feature layers can be more indicative of shape
similarities in sketch views. However, we find that similar object scales
result in the best performance of ViT and ResNet. In summary, we show that
careful selection of a fine-tuning strategy allows us to obtain consistent
improvement in zero-shot shape retrieval accuracy. We believe that our work
will have a significant impact on research in the sketch domain, providing
insights and guidance on how to adopt large pretrained models as perceptual
losses.