Encoder-Decoder Structure with Multiscale Receptive Field Block for Unsupervised Depth Estimation from Monocular Video

Songnan Chen; Junyu Han; Mengxia Tang; Ruifang Dong; Jiangming Kan

doi:10.3390/rs14122906

Remote Sensing (Jun 2022)

Encoder-Decoder Structure with Multiscale Receptive Field Block for Unsupervised Depth Estimation from Monocular Video

Songnan Chen,
Junyu Han,
Mengxia Tang,
Ruifang Dong,
Jiangming Kan

Affiliations

Songnan Chen: School of Mathematics and Computer Science, Wuhan Polytechnic University, No. 36 Huanhu Middle Road, Dongxihu District, Wuhan 430048, China
Junyu Han: School of Technology, Beijing Forestry University, No. 35 Qinghua East Road, Haidian District, Beijing 100083, China
Mengxia Tang: School of Technology, Beijing Forestry University, No. 35 Qinghua East Road, Haidian District, Beijing 100083, China
Ruifang Dong: School of Technology, Beijing Forestry University, No. 35 Qinghua East Road, Haidian District, Beijing 100083, China
Jiangming Kan: School of Technology, Beijing Forestry University, No. 35 Qinghua East Road, Haidian District, Beijing 100083, China

DOI: https://doi.org/10.3390/rs14122906
Journal volume & issue: Vol. 14, no. 12
p. 2906

Abstract

Read online

Monocular depth estimation is a fundamental yet challenging task in computer vision as depth information will be lost when 3D scenes are mapped to 2D images. Although deep learning-based methods have led to considerable improvements for this task in a single image, most existing approaches still fail to overcome this limitation. Supervised learning methods model depth estimation as a regression problem and, as a result, require large amounts of ground truth depth data for training in actual scenarios. Unsupervised learning methods treat depth estimation as the synthesis of a new disparity map, which means that rectified stereo image pairs need to be used as the training dataset. Aiming to solve such problem, we present an encoder-decoder based framework, which infers depth maps from monocular video snippets in an unsupervised manner. First, we design an unsupervised learning scheme for the monocular depth estimation task based on the basic principles of structure from motion (SfM) and it only uses adjacent video clips rather than paired training data as supervision. Second, our method predicts two confidence masks to improve the robustness of the depth estimation model to avoid the occlusion problem. Finally, we leverage the largest scale and minimum depth loss instead of the multiscale and average loss to improve the accuracy of depth estimation. The experimental results on the benchmark KITTI dataset for depth estimation show that our method outperforms competing unsupervised methods.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords