IEEE Access (Jan 2024)
ContextNet: Leveraging Comprehensive Contextual Information for Enhanced 3D Object Detection
Abstract
The progress in object detection for autonomous driving using LiDAR point cloud data has been remarkable. However, current voxel-based two-stage detectors have not fully capitalized on the wealth of contextual information present in the point cloud data. Typically, Voxel Feature Encoding (VFE) layers tend to focus exclusively on internal voxel information, neglecting the broader context. Additionally, the process of extracting 3D proposal features through Region of Interest (ROI) spatial quantization and pooling downsampling results in a loss of spatial detail within the proposed regions. This limitation in capturing contextual details presents challenges for accurate object detection and positioning, particularly over long distances. In this paper, we propose ContextNet, which leverages comprehensive contextual information for enhanced 3D object detection. Specifically, it comprises two modules: the Voxel Self-Attention Encoding module (VSAE) and the Joint Channel Self-Attention Re-weight module (JCSR). VSAE establishes dependencies between voxels through self-attention, expanding the receptive field and introducing substantial contextual information. JCSR employs joint attention to extract both local channel information and global context information from the raw point cloud within the RoI region. By integrating these two sets of information and re-weighting the point features, the 3D proposal is refined, enabling a more accurate estimation of the object’s position and confidence. Extensive experiments conducted on the KITTI dataset demonstrate that our approach outperforms voxel-based two-stage methods, particularly with a 9.5% improvement in the mAP compared to the baseline on the nuScenes test dataset, and an improved 1.61% hard AP compared to the baseline on the KITTI benchmark.
Keywords