Abstract
Existing person image generative models can do either image generation or
pose transfer but not both. We propose a unified diffusion model, UPGPT to
provide a universal solution to perform all the person image tasks -
generative, pose transfer, and editing. With fine-grained multimodality and
disentanglement capabilities, our approach offers fine-grained control over the
generation and the editing process of images using a combination of pose, text,
and image, all without needing a semantic segmentation mask which can be
challenging to obtain or edit. We also pioneer the parameterized body SMPL
model in pose-guided person image generation to demonstrate new capability -
simultaneous pose and camera view interpolation while maintaining a person's
appearance. Results on the benchmark DeepFashion dataset show that UPGPT is the
new state-of-the-art while simultaneously pioneering new capabilities of edit
and pose transfer in human image generation.