Abstract
The modelling of human geometry and appearance from videos has been a focus of computer
vision researchers for decades, driven by an exciting range of possible applications in video
games, healthcare and the TV & film industry. Research in this field has advanced on two
fronts: bottom-up methods that reconstruct human geometry from raw image data; and top-down
approaches that explain the image data with existing models of human shape. There has been an
increased focus into model-based approaches over the last 5 years, largely due to the emergence
of new statistical models of human shape and pose - notably the SMPL model - and also due
to advances in human joint estimation from images. While model-free human reconstruction
methods have been mostly limited to constrained environments, model-based methods can
exploit priors in the statistical body model to capture the human geometry in more challenging
scenarios, including: monocular video, partially occluded images, and multiple people. In this
thesis we demonstrate the advantage of parametric human models in representing the shape and
texture of humans from video input, by applying them to a range of tasks.
The first contribution demonstrates the effectiveness of a parametric human body model as a
tool for generating free-viewpoint video renderings of humans in motion. In this chapter we
introduce an optimisation framework for aligning the SMPL body model with multi-view video
of a human capture in a studio. The model-based approach consistently provides a full-body
reconstruction with fine details around the face and hands. Further, the model-based pipeline
allows for considerable compression of the reconstructed sequence: the geometry is encoded
simply as a set of model parameters; and the model structure provides a temporally consistent
texture map layout, allowing for efficient video compression of the human appearance. These
benefits allow for efficient and computationally inexpensive playback in a game engine, and in
virtual reality.
The second contribution is the model-based reconstruction of multiple people in sports. Sports
datasets are especially challenging due to multiple interacting players, heavy occlusion, low
effective player resolution and poor calibration. To extend our reconstruction pipeline to multiple
people, we introduce an algorithm for the association of 2D pose estimations of multiple people
between camera views. We also introduce a novel method for the correction of errors that are
often associated with 2D pose estimates. Finally, we introduce an algorithm for the tracking of
skeletons over time, which is robust to missing detections. We use the associated and temporally
tracked pose detections in our model-based reconstruction pipeline to generate model-based
reconstructions of multiple players in sports sequences, despite the heavy occlusion and low
detail in the original footage.
The third contribution is a method for the capture and modelling of dynamic human texture
appearance from a minimal set of input cameras. Previous methods to capture dynamic appearance of a human from multi-view video rely on large camera setups, and typically store
texture on a per-frame basis. We generate a parametric reconstruction from minimal cameras
(as few as 3) to generate partial texture observations. The parametric reconstruction provides a
temporally consistent texture map layout, as well as the human pose each frame. The partial
texture observations are combined in a learned framework to generate full-body textures with
dynamic details given an input pose. Inspired by traditional multi-view texturing algorithms,
we adopt a multi-band weighted loss function to train our network, which minimizes texture
artefacts.
Our final contribution is a novel continuous displacement field representation for the reconstruction of clothed human body shape from a single image. Recent model-free monocular human
shape estimation methods struggle with highly varied poses and occlusions, whereas parametric
methods are more robust but limited to tight clothing. Our learnt continuous displacement field
representation reconstructs detailed shape for humans in challenging poses. We combine local
image features with canonical parametric body model coordinates to build a displacement field
that models the distance between the underlying parametric model and the true human surface.
Our ParamCDF representation is also able to handle the task of inferring detailed human shape
from partially occluded images of humans.