Abstract
In recent years, there has been a growing interest towards developing personalized human avatars, for applications ranging from virtual reality, movie production, gaming and social telepresence. In the future, these avatars will be expected to also interact with everyday objects. Achieving this requires not only accurate human reconstruction, but also a joint understanding of surrounding objects and their interactions, making 3D human-object reconstruction key to its success. Existing methods that can jointly reconstruct 3D humans and objects from a single RGB image produce only coarse or template-based shapes, thus failing to capture realistic details in the reconstruction, such as loose clothing on the human body. In this work, for the first time, we propose an approach to jointly reconstruct 3D clothed humans and objects, given a monocular image of a human-object scene.
At the core of our framework, is a novel attention-based model that jointly learns an implicit function for the human and the object. Given a query point, our model utilizes pixel-aligned features from the input human-object image as well as from separate, non-occluded views of the human and the object as synthesized by a diffusion model. This allows the model to reason about human-object spatial relationships as well as to recover details from both visible and occluded regions, enabling realistic reconstruction. To guide the reconstruction, we condition the neural implicit model on human-object pose estimation priors. To support training and evaluation, we also introduce a synthetic human-object dataset. We demonstrate on real-world datasets that our approach significantly improves the perceptual quality of 3D human-object reconstruction.