Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Zhe Wang; Yongjia Zou; Jin Lv; Yang Cao; Hongfei Yu

doi:10.1109/ACCESS.2024.3494872

IEEE Access (Jan 2024)

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Zhe Wang,
Yongjia Zou,
Jin Lv,
Yang Cao,
Hongfei Yu

Affiliations

Zhe Wang: ORCiD; College of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun, Liaoning, China
Yongjia Zou: ORCiD; College of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun, Liaoning, China
Jin Lv: Neusoft Reach Automotive Technology Company Ltd., Shenyang, China
Yang Cao: ORCiD; College of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun, Liaoning, China
Hongfei Yu: ORCiD; College of Artificial Intelligence and Software, Liaoning Petrochemical University, Fushun, Liaoning, China

DOI: https://doi.org/10.1109/ACCESS.2024.3494872
Journal volume & issue: Vol. 12
pp. 167934 – 167943

Abstract

Read online

Self-supervised monocular depth estimation is a promising research area due to its ability to train models without relying on expensive and difficult-to-obtain ground truth depth labels. In this domain, models often employ Convolutional Neural Networks (CNNs) and Transformers for feature extraction. While CNNs excel at capturing local features, they struggle with global information due to their limited receptive field. On the other hand, Transformers can capture global features but are computationally expensive. To balance performance and computational efficiency, this paper proposes a lightweight self-supervised monocular depth estimation model that integrates CNN and Transformer architectures. The model introduces an Agent Attention mechanism to effectively model global context while significantly reducing computational complexity. Furthermore, spatial and channel restructured convolution techniques are utilized to minimize the computational cost associated with redundant feature extraction in visual tasks. Validation on the KITTI dataset shows that the model reaches an Absolute Relative Error of 0.104 and a Squared Relative Error of 0.757 while maintaining a nearly constant number of parameters. The accuracy improved to 0.889, with computational complexity (FLOPs) reduced to 4.993G, and training time decreased from 15.5 hours to 13.5 hours. The model also demonstrated strong generalization on the Make 3D dataset, with only 3.0M parameters and low computational complexity, indicating its suitability for resource-constrained devices.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords