Pixel Representation Augmented through Cross-Attention for High-Resolution Remote Sensing Imagery Segmentation

Yiyun Luo; Jinnian Wang; Xiankun Yang; Zhenyu Yu; Zixuan Tan

doi:10.3390/rs14215415

Remote Sensing (Oct 2022)

Pixel Representation Augmented through Cross-Attention for High-Resolution Remote Sensing Imagery Segmentation

Yiyun Luo,
Jinnian Wang,
Xiankun Yang,
Zhenyu Yu,
Zixuan Tan

Affiliations

Yiyun Luo: School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China
Jinnian Wang: School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China
Xiankun Yang: School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China
Zhenyu Yu: School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China
Zixuan Tan: School of Geography and Remote Sensing, Guangzhou University, Guangzhou 510006, China

DOI: https://doi.org/10.3390/rs14215415
Journal volume & issue: Vol. 14, no. 21
p. 5415

Abstract

Read online

Natural imagery segmentation has been transferred to land cover classification in remote sensing imagery with excellent performance. However, two key issues have been overlooked in the transfer process: (1) some objects were easily overwhelmed by the complex backgrounds; (2) interclass information for indistinguishable classes was not fully utilized. The attention mechanism in the transformer is capable of modeling long-range dependencies on each sample for per-pixel context extraction. Notably, per-pixel context from the attention mechanism can aggregate category information. Therefore, we proposed a semantic segmentation method based on pixel representation augmentation. In our method, a simplified feature pyramid was designed to decode the hierarchical pixel features from the backbone, and then decode the category representations into learnable category object embedding queries by cross-attention in the transformer decoder. Finally, pixel representation is augmented by an additional cross-attention in the transformer encoder under the supervision of auxiliary segmentation heads. The results of extensive experiments on the aerial image dataset Potsdam and satellite image dataset Gaofen Image Dataset with 15 categories (GID-15) demonstrate that the cross-attention is effective, and our method achieved the mean intersection over union (mIoU) of 86.2% and 62.5% on the Potsdam test set and GID-15 validation set, respectively. Additionally, we achieved an inference speed of 76 frames per second (FPS) on the Potsdam test dataset, higher than all the state-of-the-art models we tested on the same device.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords