Lightweight monocular depth estimation using a fusion-improved transformer

Xin Sui; Song Gao; Aigong Xu; Cong Zhang; Changqiang Wang; Zhengxu Shi

doi:10.1038/s41598-024-72682-8

Scientific Reports (Sep 2024)

Lightweight monocular depth estimation using a fusion-improved transformer

Xin Sui,
Song Gao,
Aigong Xu,
Cong Zhang,
Changqiang Wang,
Zhengxu Shi

Affiliations

Xin Sui: School of Geomatics, Liaoning Technical University
Song Gao: School of Geomatics, Liaoning Technical University
Aigong Xu: School of Geomatics, Liaoning Technical University
Cong Zhang: School of Geomatics, Liaoning Technical University
Changqiang Wang: School of Geomatics, Liaoning Technical University
Zhengxu Shi: School of Geomatics, Liaoning Technical University

DOI: https://doi.org/10.1038/s41598-024-72682-8
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 11

Abstract

Read online

Abstract The existing deep estimation networks often overlook the issue of computational efficiency while pursuing high accuracy. This paper proposes a lightweight self-supervised network that combines convolutional neural networks (CNN) and Transformers as the feature extraction and encoding layers for images, enabling the network to capture both local geometric and global semantic features for depth estimation. First, depth-separable convolution is used to construct a dilated convolution residual module based on a shallow network to improve the shallow CNN feature extraction receptive field. In the transformer, a multidepth separable convolution head transposed attention module is proposed to reduce the computational burden of spatial self-attention. In the feedforward network, a two-step gating mechanism is proposed to improve the nonlinear representation ability of the feedforward network. Finally, the CNN and transformer are integrated to implement a depth estimation network with a local-global context interaction function. Compared with other lightweight models, this model has fewer model parameters and higher estimation accuracy. It also has better generalizability for different outdoor datasets. Additionally, the inference speed can reach 87 FPS, achieving better real-time performance and accounting for both inference speed and estimation accuracy.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords