Abstract
The surge in interest towards immersive applications in entertainment and in industries has influenced research, prompting an increasing emphasis for digital avatar creation. A pivotal requirement in immersive applications is the attainment for a high-level of fidelity for these digital human avatars, mirroring real world characteristics. The focus of this thesis is on reconstructing highly realistic human avatars from input data where details of humans may not be distinctly visible due to factors such as low-resolution images of the subject, large capture volume or noise introduced by the capture system. In addition, this thesis focuses on improving the reproducibility of reconstruction techniques by using only a minimal number of consumer-grade sensors to reconstruct digital humans. This objective underscores the importance of making high-quality reconstruction techniques more accessible and feasible, irrespective of the equipment used.
The first contribution of the thesis addresses the problem of low-quality texture appearance when capturing in a large volume. Typically, the requirement to frame cameras to capture the volume in a large space results in the person occupying only a small proportion of the field of view, resulting in low-quality rendering of the captured subject. The quality of the appearance of the large-volume capture is improved through super-resolution appearance transfer from a static high-resolution appearance capture rig that involves high-resolution digital cameras to capture the person in a small volume.
The second contribution is an Attention-based Multi-Reference Super-resolution network that, given a low-resolution image, learns to adaptively transfer the most similar texture from multiple reference images to the super-resolution output whilst maintaining spatial coherence.
The concept of reference super-resolution is extended to multi-reference super-resolution by providing a more diverse pool of image features to overcome the inherent information deficit while maintaining memory efficiency. With this approach, images showing all the sides of a human model can be leveraged as references to super-resolve the texture map of the model.
The third contribution consists of a novel super-resolution human shape introduced to represent high-quality details in the 3D shape reconstructed from a single low-resolution image or from an image captured in a large volume. A novel framework learns a high-detail implicit function to represent the reconstructed shape. The proposed method reconstructs a high-quality 3D shape from a single low-resolution image.
The fourth contribution is a new method that reconstructs accurate full-body human shapes from single-view RGB-D images. The benefit of depth observations is investigated with an approach that considers a single RGB-D image as input. The introduced framework is built on neural implicit representation and proposes a data-driven strategy to learn accurate geometric details from both multi-resolution pixel-aligned and voxel-aligned features.
The final contribution of this thesis introduces a novel framework for reconstructing a complete, high-quality 3D human shape from a single image by leveraging a collection of monocular unconstrained images. Using a single image alone does not provide sufficient information to accurately reproduce details in body regions that are not visible in the input. Rather than using inaccessible multi-view capture systems to acquire multiple views of an individual, uncalibrated and unregistered images of the subject are leveraged. A novel module processes these reference images to simulate a multi-view scenario by generating 2D normal maps of the individual in the same pose as the target input. A multi-view transformer-based neural implicit model estimates the implicit representation of the complete, high-quality 3D human shape from the generated normal maps.