Complex Scene Segmentation With Local to Global Self-Attention Module and Feature Alignment Module

Xianfeng Ou; Hanpu Wang; Xinzhong Liu; Jun Zheng; Zhihao Liu; Shulun Tan; Hongzhi Zhou

doi:10.1109/ACCESS.2023.3311264

IEEE Access (Jan 2023)

Complex Scene Segmentation With Local to Global Self-Attention Module and Feature Alignment Module

Xianfeng Ou,
Hanpu Wang,
Xinzhong Liu,
Jun Zheng,
Zhihao Liu,
Shulun Tan,
Hongzhi Zhou

Affiliations

Xianfeng Ou: ORCiD; School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang, China
Hanpu Wang: ORCiD; School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang, China
Xinzhong Liu: School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang, China
Jun Zheng: College of Optical, Mechanical and Electrical Engineering, Zhejiang A&F University, Lin’An, China
Zhihao Liu: ORCiD; School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang, China
Shulun Tan: ORCiD; School of Information Science and Engineering, Hunan Institute of Science and Technology, Yueyang, China
Hongzhi Zhou: College of Mechanical Engineering, Hunan Institute of Science and Technology, Yueyang, China

DOI: https://doi.org/10.1109/ACCESS.2023.3311264
Journal volume & issue: Vol. 11
pp. 96530 – 96542

Abstract

Read online

It is challenging to accurately mode the local and global context during complex scene segmentation. To solve this problem, a scene semantic segmentation network contains local to global self-attention module and feature alignment module is proposed in this paper. The local to global self-attention module is designed to combine the local and global features, in which the transformer backbone treats all patches equally in the global scope, to extract high-level features. The improved masked transformer with feature alignment module (MtFAM), which combines the masked transformer and feature alignment module to form a new decoder structure, is designed to fuse the features obtained from the vision transformer backbone and the local to global self-attention module. Experimental results demonstrate that the proposed structure show better performance, which can improve the value of mIoU by 3.63% on the ADE20K validation dataset compared to the Vit-Tiny. In particular, it can obtain 2.23% higher mIoU value than the segmenter method using the same transformer backbone on the challenging scene segmentation benchmark.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords