Engineering Science and Technology, an International Journal (Jun 2024)

Sparse Transformer-based bins and Polarized Cross Attention decoder for monocular depth estimation

  • Hai-Kun Wang,
  • Jiahui Du,
  • Ke Song,
  • Limin Cui

Journal volume & issue
Vol. 54
p. 101705

Abstract

Read online

Calculating depth using just one image is a crucial issue since it has applications in numerous computer vision domains. Although some recent works directly obtain the depth map through some complex and powerful networks, we want to combine the encoder and decoder feature maps more effectively. To this end, we propose a novel U-Net like network. The encoder is based on Swin Transformer. For the decoder, we propose Polarization Cross Attention to effectively combine codec features by optimizing the initialization of the k and v vector. In order to conduct a more in-depth global analysis of the decoded output, a Sparse Transformer post-processing module is proposed. In the Sparse Transformer module, we adopt Kullback–Leibler divergence to obtain a sparse Q matrix and achieve O((hw)ln(hw)) in time complexity and memory usage. Results from experiments utilizing the KITTI and NYUV2 datasets demonstrate how well the suggested strategy enhances the precision of monocular depth perception when compared with state-of-the-art methods.

Keywords