Remote Sensing (May 2025)

RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model

  • Qiang Zhang,
  • Decheng Wang,
  • Xiao Yu

DOI
https://doi.org/10.3390/rs17101661
Journal volume & issue
Vol. 17, no. 10
p. 1661

Abstract

Read online

The foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled samples, which introduce great challenges to image interpretation. To reduce the gap between nature scene images and remote sensing images, this paper proposes a novel RLita optimization method for foundation models. Specifically, a region-level image–text alignment optimization method is proposed to represent the features of images and texts as visual and sematic representation vectors in one embedding space for better model generalization, and a parameter-efficient tuning strategy is designed to reduce computational resources. Experiments on five remote sensing datasets including object detection, semantic segmentation, and change detection show the effectiveness of the RLita method.

Keywords