Charles Duncan Malleson

Research Fellow B, School of Computer Science & Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey

Journal article

Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data

by Moira Shooter, Charles Malleson and Adrian Hilton

Posted to a preprint site 20/06/2024

We introduce a new benchmark analysis focusing on 3D canine pose estimation from monocular in-the-wild images. A multi-modal dataset 3DDogs-Lab was captured indoors, featuring various dog breeds trotting on a walkway. It includes data from optical marker-based mocap systems, RGBD cameras, IMUs, and a pressure mat. While providing high-quality motion data, the presence of optical markers and limited background diversity make the captured video less representative of real-world conditions. To address this, we created 3DDogs-Wild, a naturalised version of the dataset where the optical markers are in-painted and the subjects are placed in diverse environments, enhancing its utility for training RGB image-based pose detectors. We show that using the 3DDogs-Wild to train the models leads to improved performance when evaluating on in-the-wild data. Additionally, we provide a thorough analysis using various pose estimation models, revealing their respective strengths and weaknesses. We believe that our findings, coupled with the datasets provided, offer valuable insights for advancing 3D animal pose estimation.

Journal article Peer reviewed

Wearable apparatus for correction of visual alignment under torsional strabismus

by Charles Malleson and Jean-Yves Guillemaut

Published 04/2024

Optometry and Vision Science, 101, 4, 204 - 210

SIGNIFICANCE

A wearable optical apparatus that compensates for eye misalignment (strabismus) to correct for double vision (diplopia) is proposed. In contrast to prism lenses, commonly used to compensate for horizontal and/or vertical misalignment, the proposed approach is able to compensate for any combination of horizontal, vertical, and torsional misalignment.

PURPOSE

If the action of the extraocular muscles is compromised (e.g., by nerve damage), a patient may lose their ability to maintain visual alignment, negatively affecting their binocular fusion and stereo depth perception capability. Torsional misalignment cannot be mitigated by standard Fresnel prism lenses. Surgical procedures intended to correct torsional misalignment may be unpredictable. A wearable device able to rectify visual alignment and restore stereo depth perception without surgical intervention could potentially be of great value to people with strabismus.

METHODS

We propose a novel lightweight wearable optical device for visual alignment correction. The device comprises two mirrors and a Fresnel prism, arranged in such a way that together they rotationally shift the view seen by the affected eye horizontally, vertically, and torsionally. The extent of the alignment correction on each axis can be arbitrarily adjusted according to the patient's particular misalignment characteristics.

RESULTS

The proposed approach was tested by computer simulation, and a prototype device was manufactured. The prototype device was tested by a strabismus patient exhibiting horizontal and torsional misalignment. In these tests, the device was found to function as intended, allowing the patient to enjoy binocular fusion and stereo depth perception while wearing the device for daily activities over a period of several months.

CONCLUSIONS

The proposed device is effective in correcting arbitrary horizontal, vertical, and torsional misalignment of the eyes. The results of the initial testing performed are highly encouraging. Future study is warranted to formally assess the effectiveness of the device on multiple test patients.

Journal article Peer reviewed

Real-Time Multi-person Motion Capture from Multi-view Video and IMUs.

by Charles Malleson, John Collomosse and Adrian Hilton

First online publication 17/12/2019

International Journal of Computer Vision

A real-time motion capture system is presented which uses input from multiple standard video cameras and inertial measurement units (IMUs). The system is able to track multiple people simultaneously and requires no optical markers, specialized infra-red cameras or foreground/background segmentation, making it applicable to general indoor and outdoor scenarios with dynamic backgrounds and lighting. To overcome limitations of prior video or IMU-only approaches, we propose to use flexible combinations of multiple-view, calibrated video and IMU input along with a pose prior in an online optimization-based framework, which allows the full 6-DoF motion to be recovered including axial rotation of limbs and drift-free global position. A method for sorting and assigning raw input 2D keypoint detections into corresponding subjects is presented which facilitates multi-person tracking and rejection of any bystanders in the scene. The approach is evaluated on data from several indoor and outdoor capture environments with one or more subjects and the trade-off between input sparsity and tracking performance is discussed. State-of-the-art pose estimation performance is obtained on the Total Capture (mutli-view video and IMU) and Human 3.6M (multi-view video) datasets. Finally, a live demonstrator for the approach is presented showing real-time capture, solving and character animation using a light-weight, commodity hardware setup.

Book chapter

3D Reconstruction from RGB-D Data

by Charles Malleson, Jean-Yves Guillemaut and Adrian Hilton

First online publication 27/10/2019

RGB-D Image Analysis and Processing, pp 87 - 115

A key task in computer vision is that of generating virtual 3D models of real-world scenes by reconstructing the shape, appearance and, in the case of dynamic scenes, motion of the scene from visual sensors. Recently, low-cost video plus depth (RGB-D) sensors have become widely available and have been applied to 3D reconstruction of both static and dynamic scenes. RGB-D sensors contain an active depth sensor, which provides a stream of depth maps alongside standard colour video. The low cost and ease of use of RGB-D devices as well as their video rate capture of images along with depth make them well suited to 3D reconstruction. Use of active depth capture overcomes some of the limitations of passive monocular or multiple-view video-based approaches since reliable, metrically accurate estimates of the scene depth at each pixel can be obtained from a single view, even in scenes that lack distinctive texture. There are two key components to 3D reconstruction from RGB-D data: (1) spatial alignment of the surface over time and, (2) fusion of noisy, partial surface measurements into a more complete, consistent 3D model. In the case of static scenes, the sensor is typically moved around the scene and its pose is estimated over time. For dynamic scenes, there may be multiple rigid, articulated, or non-rigidly deforming surfaces to be tracked over time. The fusion component consists of integration of the aligned surface measurements, typically using an intermediate representation, such as the volumetric truncated signed distance field (TSDF). In this chapter, we discuss key recent approaches to 3D reconstruction from depth or RGB-D input, with an emphasis on real-time reconstruction of static scenes.

Journal article Open access Peer reviewed

Fusing Visual and Inertial Sensors with Semantics for 3D Human Pose Estimation

by Andrew Gilbert, Matthew Trumble, Charles Malleson, Adrian Hilton and John Collomosse

Published 08/09/2018

International Journal of Computer Vision

We propose an approach to accurately esti- mate 3D human pose by fusing multi-viewpoint video (MVV) with inertial measurement unit (IMU) sensor data, without optical markers, a complex hardware setup or a full body model. Uniquely we use a multi-channel 3D convolutional neural network to learn a pose em- bedding from visual occupancy and semantic 2D pose estimates from the MVV in a discretised volumetric probabilistic visual hull (PVH). The learnt pose stream is concurrently processed with a forward kinematic solve of the IMU data and a temporal model (LSTM) exploits the rich spatial and temporal long range dependencies among the solved joints, the two streams are then fused in a final fully connected layer. The two complemen- tary data sources allow for ambiguities to be resolved within each sensor modality, yielding improved accu- racy over prior methods. Extensive evaluation is per- formed with state of the art performance reported on the popular Human 3.6M dataset [26], the newly re- leased TotalCapture dataset and a challenging set of outdoor videos TotalCaptureOutdoor. We release the new hybrid MVV dataset (TotalCapture) comprising of multi- viewpoint video, IMU and accurate 3D skele- tal joint ground truth derived from a commercial mo- tion capture system. The dataset is available online at http://cvssp.org/data/totalcapture/.

Journal article Open access Peer reviewed

Hybrid modelling of non-rigid scenes from RGBD cameras

by Charles Malleson, Jean-Yves Guillemaut and Adrian Hilton

Published 03/08/2018

IEEE Transactions on Circuits and Systems for Video Technology

Recent advances in sensor technology have introduced low-cost RGB video plus depth sensors, such as the Kinect, which enable simultaneous acquisition of colour and depth images at video rates. This paper introduces a framework for representation of general dynamic scenes from video plus depth acquisition. A hybrid representation is proposed which combines the advantages of prior surfel graph surface segmentation and modelling work with the higher-resolution surface reconstruction capability of volumetric fusion techniques. The contributions are (1) extension of a prior piecewise surfel graph modelling approach for improved accuracy and completeness, (2) combination of this surfel graph modelling with TSDF surface fusion to generate dense geometry, and (3) proposal of means for validation of the reconstructed 4D scene model against the input data and efficient storage of any unmodelled regions via residual depth maps. The approach allows arbitrary dynamic scenes to be efficiently represented with temporally consistent structure and enhanced levels of detail and completeness where possible, but gracefully falls back to raw measurements where no structure can be inferred. The representation is shown to facilitate creative manipulation of real scene data which would previously require more complex capture setups or manual processing.

Conference proceeding

Real-time Full-Body Motion Capture from Video and IMUs

by Charles Malleson, Marco Volino, Andrew Gilbert, Matthew Trumble, John Collomosse, Adrian Hilton and IEEE

Published 01/01/2017

PROCEEDINGS 2017 INTERNATIONAL CONFERENCE ON 3D VISION (3DV), 449 - 457

A real-time full-body motion capture system is presented which uses input from a sparse set of inertial measurement units (IMUs) along with images from two or more standard video cameras and requires no optical markers or specialized infra-red cameras. A real-time optimization-based framework is proposed which incorporates constraints from the IMUs, cameras and a prior pose model. The combination of video and IMU data allows the full 6-DOF motion to be recovered including axial rotation of limbs and drift-free global position. The approach was tested using both indoor and outdoor captured data. The results demonstrate the effectiveness of the approach for tracking a wide range of human motion in real time in unconstrained indoor/outdoor scenes.

Journal article Open access Peer reviewed

Virtual Volumetric Graphics on Commodity Displays using 3D Viewer Tracking

by C Malleson and J Collomosse

Published 01/02/2013

International Journal of Computer Vision (IJCV), 101, 3, 519 - 532

Three dimensional (3D) displays typically rely on stereo disparity, requiring specialized hardware to be worn or embedded in the display. We present a novel 3D graphics display system for volumetric scene visualization using only standard 2D display hardware and a pair of calibrated web cameras. Our computer vision-based system requires no worn or other special hardware. Rather than producing the depth illusion through disparity, we deliver a full volumetric 3D visualization - enabling users to interactively explore 3D scenes by varying their viewing position and angle according to the tracked 3D position of their face and eyes. We incorporate a novel wand-based calibration that allows the cameras to be placed at arbitrary positions and orientations relative to the display. The resulting system operates at real-time speeds (~25 fps) with low latency (120-225 ms) delivering a compelling natural user interface and immersive experience for 3D viewing. In addition to objective evaluation of display stability and responsiveness, we report on user trials comparing users' timings on a spatial orientation task.

Charles Duncan Malleson

Research Fellow B, School of Computer Science & Electronic Engineering, Faculty of Engineering and Physical Sciences, University of Surrey

Output list