CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation

Longze Zhu; Zhizhong Kang; Mei Zhou; Xi Yang; Zhen Wang; Zhen Cao; Chenming Ye

doi:10.3390/s22218520

Sensors (Nov 2022)

CMANet: Cross-Modality Attention Network for Indoor-Scene Semantic Segmentation

Longze Zhu,
Zhizhong Kang,
Mei Zhou,
Xi Yang,
Zhen Wang,
Zhen Cao,
Chenming Ye

Affiliations

Longze Zhu: School of Land Science and Technology, China University of Geosciences, Beijing 100083, China
Zhizhong Kang: School of Land Science and Technology, China University of Geosciences, Beijing 100083, China
Mei Zhou: Key Laboratory of Quantitative Remote Sensing Information Technology, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
Xi Yang: College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou 310023, China
Zhen Wang: School of Land Science and Technology, China University of Geosciences, Beijing 100083, China
Zhen Cao: School of Land Science and Technology, China University of Geosciences, Beijing 100083, China
Chenming Ye: School of Land Science and Technology, China University of Geosciences, Beijing 100083, China

DOI: https://doi.org/10.3390/s22218520
Journal volume & issue: Vol. 22, no. 21
p. 8520

Abstract

Read online

Indoor-scene semantic segmentation is of great significance to indoor navigation, high-precision map creation, route planning, etc. However, incorporating RGB and HHA images for indoor-scene semantic segmentation is a promising yet challenging task, due to the diversity of textures and structures and the disparity of multi-modality in physical significance. In this paper, we propose a Cross-Modality Attention Network (CMANet) that facilitates the extraction of both RGB and HHA features and enhances the cross-modality feature integration. CMANet is constructed under the encoder–decoder architecture. The encoder consists of two parallel branches that successively extract the latent modality features from RGB and HHA images, respectively. Particularly, a novel self-attention mechanism-based Cross-Modality Refine Gate (CMRG) is presented, which bridges the two branches. More importantly, the CMRG achieves cross-modality feature fusion and produces certain refined aggregated features; it serves as the most crucial part of CMANet. The decoder is a multi-stage up-sampled backbone that is composed of different residual blocks at each up-sampling stage. Furthermore, bi-directional multi-step propagation and pyramid supervision are applied to assist the leaning process. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments are conducted on NYUDv2 and SUN RGB-D datasets. Experimental results demonstrate that our method outperforms the existing ones for indoor semantic-segmentation tasks.

Published in Sensors

ISSN: 1424-8220 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Technology: Chemical technology
Website: http://www.mdpi.com/journal/sensors

About the journal

Abstract

Keywords