Abstract
Representation learning aims to discover individual salient features of a
domain in a compact and descriptive form that strongly identifies the unique
characteristics of a given sample respective to its domain. Existing works in
visual style representation literature have tried to disentangle style from
content during training explicitly. A complete separation between these has yet
to be fully achieved. Our paper aims to learn a representation of visual
artistic style more strongly disentangled from the semantic content depicted in
an image. We use Neural Style Transfer (NST) to measure and drive the learning
signal and achieve state-of-the-art representation learning on explicitly
disentangled metrics. We show that strongly addressing the disentanglement of
style and content leads to large gains in style-specific metrics, encoding far
less semantic information and achieving state-of-the-art accuracy in downstream
multimodal applications.