Abstract
We present a novel method to learn temporally consistent 3D reconstruction of
clothed people from a monocular video. Recent methods for 3D human
reconstruction from monocular video using volumetric, implicit or parametric
human shape models, produce per frame reconstructions giving temporally
inconsistent output and limited performance when applied to video. In this
paper, we introduce an approach to learn temporally consistent features for
textured reconstruction of clothed 3D human sequences from monocular video by
proposing two advances: a novel temporal consistency loss function; and hybrid
representation learning for implicit 3D reconstruction from 2D images and
coarse 3D geometry. The proposed advances improve the temporal consistency and
accuracy of both the 3D reconstruction and texture prediction from a monocular
video. Comprehensive comparative performance evaluation on images of people
demonstrates that the proposed method significantly outperforms the
state-of-the-art learning-based single image 3D human shape estimation
approaches achieving significant improvement of reconstruction accuracy,
completeness, quality and temporal consistency.