Abstract
In this paper, we extend scene understanding to include that of human sketch.
The result is a complete trilogy of scene representation from three diverse and
complementary modalities -- sketch, photo, and text. Instead of learning a
rigid three-way embedding and be done with it, we focus on learning a flexible
joint embedding that fully supports the ``optionality" that this
complementarity brings. Our embedding supports optionality on two axes: (i)
optionality across modalities -- use any combination of modalities as query for
downstream tasks like retrieval, (ii) optionality across tasks --
simultaneously utilising the embedding for either discriminative (e.g.,
retrieval) or generative tasks (e.g., captioning). This provides flexibility to
end-users by exploiting the best of each modality, therefore serving the very
purpose behind our proposal of a trilogy in the first place. First, a
combination of information-bottleneck and conditional invertible neural
networks disentangle the modality-specific component from modality-agnostic in
sketch, photo, and text. Second, the modality-agnostic instances from sketch,
photo, and text are synergised using a modified cross-attention. Once learned,
we show our embedding can accommodate a multi-facet of scene-related tasks,
including those enabled for the first time by the inclusion of sketch, all
without any task-specific modifications. Project Page:
\url{http://www.pinakinathc.me/scenetrilogy}