Abstract
Monocular depth estimation has become one of the most studied applications in
computer vision, where the most accurate approaches are based on fully
supervised learning models. However, the acquisition of accurate and large
ground truth data sets to model these fully supervised methods is a major
challenge for the further development of the area. Self-supervised methods
trained with monocular videos constitute one the most promising approaches to
mitigate the challenge mentioned above due to the wide-spread availability of
training data. Consequently, they have been intensively studied, where the main
ideas explored consist of different types of model architectures, loss
functions, and occlusion masks to address non-rigid motion. In this paper, we
propose two new ideas to improve self-supervised monocular trained depth
estimation: 1) self-attention, and 2) discrete disparity prediction. Compared
with the usual localised convolution operation, self-attention can explore a
more general contextual information that allows the inference of similar
disparity values at non-contiguous regions of the image. Discrete disparity
prediction has been shown by fully supervised methods to provide a more robust
and sharper depth estimation than the more common continuous disparity
prediction, besides enabling the estimation of depth uncertainty. We show that
the extension of the state-of-the-art self-supervised monocular trained depth
estimator Monodepth2 with these two ideas allows us to design a model that
produces the best results in the field in KITTI 2015 and Make3D, closing the
gap with respect self-supervised stereo training and fully supervised
approaches.