Abstract
In the multi-view domain, it is challenging to correctly label multiple people across viewpoints because of occlusions, visual ambiguities, appearance variation, etc. Deep learning, although having witnessed remarkable success in computer vision tasks, still remains underexplored for the multi-view labelling task, due to the lack of labelled multi-view datasets. In this paper, we propose a novel end-to-end deep neural network named Multi-View Labelling network (MVL-net) that addresses this issue. To overcome the dataset shortage, a large-scale multi-view dataset is generated by combining 3D human models and panoramic backgrounds, along with human poses and realistic rendering. In the proposed MVL-net, we first incorporate Transformer blocks to capture the non-local information for multi-view feature extraction. A matching net is then introduced to achieve multiple people labelling, by predicting matching confidence scores for pairwise instances from two views, thus addressing the problem of the unknown number of people when labelling across views. An additional geometry feature obtained from the epipolar geometry is integrated to leverage multi-view cues during training. To the best of our knowledge, the MVL-net is the first work using deep learning to train a multi-view labelling network. Comprehensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed method, which outperforms the existing state-of-the-art approaches.