Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images

Shuangzhou Liu; Feng Wang; Hongjian You; Niangang Jiao; Guangyao Zhou; Tingtao Zhang

doi:10.3390/rs16132472

Remote Sensing (Jul 2024)

Context-Aggregated and SAM-Guided Network for ViT-Based Instance Segmentation in Remote Sensing Images

Shuangzhou Liu,
Feng Wang,
Hongjian You,
Niangang Jiao,
Guangyao Zhou,
Tingtao Zhang

Affiliations

Shuangzhou Liu: Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Feng Wang: Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Hongjian You: Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Niangang Jiao: Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Guangyao Zhou: Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China
Tingtao Zhang: Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

DOI: https://doi.org/10.3390/rs16132472
Journal volume & issue: Vol. 16, no. 13
p. 2472

Abstract

Read online

Instance segmentation of remote sensing images can not only provide object-level positioning information but also provide pixel-level positioning information. This pixel-level information annotation has a wide range of uses in the field of remote sensing, and it is of great value for environmental detection and resource management. Because optical images generally have complex terrain environments and changeable object shapes, SAR images are affected by complex scattering phenomena, and the mask quality obtained by the traditional instance segmentation method used in remote sensing images is not high. Therefore, it is a challenging task to improve the mask quality of instance segmentation in remote sensing images. Since the traditional two-stage instance segmentation method consists of backbone, neck, bbox head, and mask head, the final mask quality depends on the product of all front-end work quality. Therefore, we consider the difficulty of optical and SAR images to bring instance segmentation to the targeted improvement of the neck, bbox head, and mask head, and we propose the Context-Aggregated and SAM-Guided Network (CSNet). In this network, the plain feature fusion pyramid network (PFFPN) can generate a pyramid for the plain feature and provide a feature map of the appropriate instance scale for detection and segmentation. The network also includes a context aggregation bbox head (CABH), which uses the context information and instance information around the instance to solve the problem of missed detection and false detection in detection. The network also has a SAM-Guided mask head (SGMH), which learns by using SAM as a teacher, and uses the knowledge learned to improve the edge of the mask. Experimental results show that CSNet significantly improves the quality of masks generated under optical and SAR images, and CSNet achieves 5.1% and 3.2% AP increments compared with other SOTA models.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords