IEEE Access (Jan 2024)
Eite-Mono: An Extreme Lightweight Architecture for Self-Supervised Monocular Depth Estimation
Abstract
In intelligent mine construction, depth prediction via machine vision plays a pivotal role in enhancing visual perception. This need, coupled with the scarcity of high-quality monocular depth estimation datasets, has led to the development of self-supervised learning approaches for depth prediction in complex scenarios. To address the balance between model complexity and feature extraction capability, we introduce Eite-Mono, a lightweight Monocular Depth Estimation (MDE) framework. Eite-Mono features a dual-component architecture, incorporating PoseNet for camera motion estimation and EiteDepth for dense depth prediction, which utilizes a U-shaped encoder-decoder to extract multiscale features. Central to our approach is the Local-Global Feature Aggregation (LGFA) module, which captures both local and global image features efficiently, using lightweight CNN and Vision Transformer structures to minimize model size and computational load. Evaluated on the KITTI and Make3D datasets, Eite-Mono outperforms existing self-supervised MDE models in both accuracy and complexity, surpassing Lite-Mono in accuracy with 60% fewer trainable parameters. Furthermore, we have assessed the performance of our method in mining scenarios, demonstrating its remarkable superiority over the comparative approaches.
Keywords