Abstract
Human tracking and 3D pose estimation are two core activities of computer vision, identifying
and following an individual within a scene in the case of the former, and producing a three-dimensional estimate of an individuals body pose and configuration for the latter. This
combination of processes cane applied to scenarios such as entertainment production, home
health monitoring or sports analysis, where there is a relatively low person count, and
a requirement to know more than just the approximate three-dimensional position of a
person. However, such systems generally require non-complimentary camera configurations,
requiring two separate but overlapping camera rigs to be established.
An ideal solution would be to combine our camera configurations, something which can
easily be achieved using wide angle, panoramic or 360° cameras. Through careful placement
of these cameras, we simultaneously view the entire scene, and also produce multiple views
of an individual in order to inform our pose estimate. However, such cameras bring their
own representation problems, hampering the performance of existing solutions, or preventing
them from operating entirely. Therefore, we explore this facet of the problem, producing
tracking and pose estimation solutions that natively function from 360° imagery.
To facilitate this, we firstly contribute a tracker and pose estimation system, operating
from a pair of horizontally disjoint 360° cameras. We use provided person segmentation
masks to create descriptors suitable for use at differing resolutions, while the specific
camera configuration allows us to share these descriptors, using these combined with spatial
information to track an individual regardless of their distance from either camera. With a
person isolated, we then create a joint-wise pose estimate directly from the spherical coordinate
space, eliminating the need for either reprojection operations, or intrinsic calibration
information to be provided.
Our second contribution reconfigures these cameras to a low, vertical baseline configuration.
We simultaneously track each individual in the scene using only two-dimensional joint
location estimates, exploiting the camera arrangement to assume an Epipolar relationship. A
temporally consistent 3D human pose estimate is then constructed, first as a coarse, Principal Component Analysis (PCA) model, then refined in a joint-wise fashion over successive
iterations, smoothing out any unrealistic jumps in motion.
Having established tracking in a local area, our final contribution moves beyond the confines
of a single room, and tracks individuals as they move throughout a scene comprised of
multiple rooms or regions. We perform this with no prior knowledge of the scene layout or
content, and use only camera extrinsics and person movements to iteratively build tracks for
each individual simultaneously, with each stage informing the next.
Overall, we demonstrate that 360° imagery presents many advantages that can be utilised
or exploited in both human tracking, and in three-dimensional human pose estimation.
We enable tracking in a variety of situations where traditional methods are impractical
or impossible, and position methods to provide training data for the next generation of
multi-camera, 360° capable deep-learning based tracking approaches. We also produce pose
estimates that bridge the gap between multi-view systems and monocular systems.