IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
MFSA-Net: Semantic Segmentation With Camera-LiDAR Cross-Attention Fusion Based on Fast Neighbor Feature Aggregation
Abstract
Given the inherent limitations of camera-only and LiDAR-only methods in performing semantic segmentation tasks in large-scale complex environments, multimodal information fusion for semantic segmentation has become a focal point of contemporary research. However, significant modal disparities often result in existing fusion-based methods struggling with low segmentation accuracy and limited efficiency in large-scale complex environments. To address these challenges,we propose a semantic segmentation network with camera–LiDAR cross-attention fusion based on fast neighbor feature aggregation (MFSA-Net), which is better suited for large-scale semantic segmentation in complex environments. Initially, we propose a dual-distance attention feature aggregation module based on rapid 3-D nearest neighbor search. This module employs a sliding window method in point cloud perspective projections for swift proximity search, and efficiently combines feature distance and Euclidean distance information to learn more distinctive local features. This improves segmentation accuracy while ensuring computational efficiency. Furthermore, we propose a cross-attention fusion two-stream network based on residual, which allows for more effective integration of camera information into the LiDAR data stream, enhancing both accuracy and robustness. Extensive experimental results on the large-scale point cloud datasets SemanticKITTI and Nuscenes demonstrate that our proposed algorithm outperforms similar algorithms in semantic segmentation performance in large-scale complex environments.
Keywords