IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)
Reperceive Global Vision of Transformer for Remote Sensing Images Weakly Supervised Object Localization
Abstract
In recent decades, weakly supervised object localization (WSOL) has gained increasing attention in remote sensing. However, unlike optical images, remote sensing images (RSIs) often contain more complex scenes, which poses challenges for WSOL. Traditional convolutional neural network (CNN)-based WSOL methods are often limited by a small receptive field and yield unsatisfactory results. Transformer-based methods can obtain global perception, addressing the limitations of receptive fields in CNN-based methods, yet it may also introduce attention diffusion. To address the aforementioned problems, this article proposes a novel WSOL method based on an interpretable vision transformer (ViT), RPGV. We introduce a feature fusion enhancement module to obtain the saliency map that captures global information. Simultaneously, we solve the problem of discrete attention in the traditional ViT and eliminate local distortion in the feature map by introducing a global semantic screening module. We conduct comprehensive experiments on DIOR and HRRSD datasets, demonstrating the superior performance of our method compared to current state-of-the-art methods.
Keywords