Abstract
We propose a multi-view framework for joint object detection and labelling based on pairs of images. The proposed framework extends the single-view Mask R-CNN approach to multiple views without need for additional training. Dedicated components are embedded into the framework to match objects across views by enforcing epipolar constraints, appearance feature similarity and class coherence. The multi-view extension enables the proposed framework to detect objects which would otherwise be mis-detected in a classical Mask R-CNN approach, and achieves coherent object labelling across views. By avoiding the need for additional training, the approach effectively overcomes the current shortage of multi-view datasets. The proposed framework achieves high quality results on a range of complex scenes, being able to output class, bounding box, mask and an additional label enforcing coherence across views. In the evaluation, we show qualitative and quantitative results on several challenging outd oor multi-view datasets and perform a comprehensive comparison to verify the advantages of the proposed method