Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation

Wei-Jong Yang; Chih-Chen Wu; Jar-Ferr Yang

doi:10.3390/s25010080

Sensors (Dec 2024)

Residual Vision Transformer and Adaptive Fusion Autoencoders for Monocular Depth Estimation

Wei-Jong Yang,
Chih-Chen Wu,
Jar-Ferr Yang

Affiliations

Wei-Jong Yang: Department of Artificial Intelligence and Computer Engineering, National Chin-Yi University of Technology, Taichung 411, Taiwan
Chih-Chen Wu: Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan
Jar-Ferr Yang: Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan

DOI: https://doi.org/10.3390/s25010080
Journal volume & issue: Vol. 25, no. 1
p. 80

Abstract

Read online

Precision depth estimation plays a key role in many applications, including 3D scene reconstruction, virtual reality, autonomous driving and human–computer interaction. Through recent advancements in deep learning technologies, monocular depth estimation, with its simplicity, has surpassed the traditional stereo camera systems, bringing new possibilities in 3D sensing. In this paper, by using a single camera, we propose an end-to-end supervised monocular depth estimation autoencoder, which contains an encoder with a structure with a mixed convolution neural network and vision transformers and an effective adaptive fusion decoder to obtain high-precision depth maps. In the encoder, we construct a multi-scale feature extractor by mixing residual configurations of vision transformers to enhance both local and global information. In the adaptive fusion decoder, we introduce adaptive fusion modules to effectively merge the features of the encoder and the decoder together. Lastly, the model is trained using a loss function that aligns with human perception to enable it to focus on the depth values of foreground objects. The experimental results demonstrate the effective prediction of the depth map from a single-view color image by the proposed autoencoder, which increases the first accuracy rate about 28% and reduces the root mean square error about 27% compared to an existing method in the NYU dataset.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords