Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images

Shaofeng Zhang; Mengmeng Li; Wufan Zhao; Xiaoqin Wang; Qunyong Wu

doi:10.1109/JSTARS.2024.3501678

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2025)

Building Type Classification Using CNN-Transformer Cross-Encoder Adaptive Learning From Very High Resolution Satellite Images

Shaofeng Zhang,
Mengmeng Li,
Wufan Zhao,
Xiaoqin Wang,
Qunyong Wu

Affiliations

Shaofeng Zhang: ORCiD; Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, China
Mengmeng Li: ORCiD; Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, China
Wufan Zhao: ORCiD; Urban Governance and Design Thrust, Society Hub, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Xiaoqin Wang: Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, China
Qunyong Wu: Key Laboratory of Spatial Data Mining and Information Sharing of Ministry of Education, Academy of Digital China, Fuzhou University, Fuzhou, China

DOI: https://doi.org/10.1109/JSTARS.2024.3501678
Journal volume & issue: Vol. 18
pp. 976 – 994

Abstract

Read online

Building type information indicates the functional properties of buildings and plays a crucial role in smart city development and urban socioeconomic activities. Existing methods for classifying building types often face challenges in accurately distinguishing buildings between types while maintaining well-delineated boundaries, especially in complex urban environments. This study introduces a novel framework, i.e., CNN-Transformer cross-attention feature fusion network (CTCFNet), for building type classification from very high resolution remote sensing images. CTCFNet integrates convolutional neural networks (CNNs) and Transformers using an interactive cross-encoder fusion module that enhances semantic feature learning and improves classification accuracy in complex scenarios. We develop an adaptive collaboration optimization module that applies human visual attention mechanisms to enhance the feature representation of building types and boundaries simultaneously. To address the scarcity of datasets in building type classification, we create two new datasets, i.e., the urban building type (UBT) dataset and the town building type (TBT) dataset, for model evaluation. Extensive experiments on these datasets demonstrate that CTCFNet outperforms popular CNNs, Transformers, and dual-encoder methods in identifying building types across various regions, achieving the highest mean intersection over union of 78.20% and 77.11%, F1 scores of 86.83% and 88.22%, and overall accuracy of 95.07% and 95.73% on the UBT and TBT datasets, respectively. We conclude that CTCFNet effectively addresses the challenges of high interclass similarity and intraclass inconsistency in complex scenes, yielding results with well-delineated building boundaries and accurate building types.

Published in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing

ISSN: 1939-1404 (Print); 2151-1535 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Ocean engineering; Science: Physics: Geophysics. Cosmic physics
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=4609443

About the journal

Abstract

Keywords