MFSA-Net: Semantic Segmentation With Camera-LiDAR Cross-Attention Fusion Based on Fast Neighbor Feature Aggregation

Yijian Duan; Liwen Meng; Yanmei Meng; Jihong Zhu; Jiacheng Zhang; Jinlai Zhang; Xin Liu

doi:10.1109/JSTARS.2024.3472751

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

MFSA-Net: Semantic Segmentation With Camera-LiDAR Cross-Attention Fusion Based on Fast Neighbor Feature Aggregation

Yijian Duan,
Liwen Meng,
Yanmei Meng,
Jihong Zhu,
Jiacheng Zhang,
Jinlai Zhang,
Xin Liu

Affiliations

Yijian Duan: ORCiD; College of Mechanical Engineering, Guangxi University, Nanning, China
Liwen Meng: College of Mechanical Engineering, Guangxi University, Nanning, China
Yanmei Meng: ORCiD; College of Mechanical Engineering, Guangxi University, Nanning, China
Jihong Zhu: ORCiD; Department of Precision Instrument, Tsinghua University, Beijing, China
Jiacheng Zhang: College of Mechanical Engineering, Guangxi University, Nanning, China
Jinlai Zhang: ORCiD; College of Automotive and Mechanical Engineering, Changsha University of Science and Technology, Changsha, China
Xin Liu: College of Mechanical Engineering, Guangxi University, Nanning, China

DOI: https://doi.org/10.1109/JSTARS.2024.3472751
Journal volume & issue: Vol. 17
pp. 19627 – 19639

Abstract

Read online

Given the inherent limitations of camera-only and LiDAR-only methods in performing semantic segmentation tasks in large-scale complex environments, multimodal information fusion for semantic segmentation has become a focal point of contemporary research. However, significant modal disparities often result in existing fusion-based methods struggling with low segmentation accuracy and limited efficiency in large-scale complex environments. To address these challenges,we propose a semantic segmentation network with camera–LiDAR cross-attention fusion based on fast neighbor feature aggregation (MFSA-Net), which is better suited for large-scale semantic segmentation in complex environments. Initially, we propose a dual-distance attention feature aggregation module based on rapid 3-D nearest neighbor search. This module employs a sliding window method in point cloud perspective projections for swift proximity search, and efficiently combines feature distance and Euclidean distance information to learn more distinctive local features. This improves segmentation accuracy while ensuring computational efficiency. Furthermore, we propose a cross-attention fusion two-stream network based on residual, which allows for more effective integration of camera information into the LiDAR data stream, enhancing both accuracy and robustness. Extensive experimental results on the large-scale point cloud datasets SemanticKITTI and Nuscenes demonstrate that our proposed algorithm outperforms similar algorithms in semantic segmentation performance in large-scale complex environments.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords