Scientific Reports (Oct 2024)
Dense projection fusion for 3D object detection
Abstract
Abstract Fusing information from LiDAR and cameras can effectively enhance the overall perceptivity of autonomous vehicles in various scenarios. Despite the relatively good results achieved by point-wise fusion and Bird’s-Eye-View (BEV) fusion, they still cannot fully leverage the image information and lack of effective depth information. For any fusion methods, the multi-modal features first need to be concatenated along the channel, and then the fused features are extracted using convolutional layers. However, this type of fusion methods is effective, but too coarse which causes that the fused features cannot pay more attention to the regions with important features and suffer from severe noise. To tackle these issues, we propose in this paper a Dense Projection Fusion (DPFusion) approach. It consists of two new modules: dense depth map guided BEV transform (DGBT) module and multi-modal feature adaptive fusion (MFAF) module. The DGBT module first quickly estimates the depth of each pixel and then projects all image features to the BEV space, making full use of the image information. The MFAF module computes the image weights and point cloud weights for each channel in each BEV grid and then adaptively weights and fuses the image BEV features with the point cloud BEV features. It is worth pointing out that the MFAF module makes the fused features pay more attention to background outline and object outline. Our proposed DPFusion demonstrates competitive results in 3D object detection, achieving a mean Average Precision (mAP) of 70.4 and a nuScenes detection score (NDS) of 72.3 on the nuScenes validation set.