RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model

Qiang Zhang; Decheng Wang; Xiao Yu

doi:10.3390/rs17101661

Remote Sensing (May 2025)

RLita: A Region-Level Image–Text Alignment Method for Remote Sensing Foundation Model

Qiang Zhang,
Decheng Wang,
Xiao Yu

Affiliations

Qiang Zhang: Beijing Institute of Tracking and Telecommunication Technology, Beijing 100094, China
Decheng Wang: Beijing Institute of Tracking and Telecommunication Technology, Beijing 100094, China
Xiao Yu: Beijing Institute of Tracking and Telecommunication Technology, Beijing 100094, China

DOI: https://doi.org/10.3390/rs17101661
Journal volume & issue: Vol. 17, no. 10
p. 1661

Abstract

Read online

The foundation model fine-tuning optimization method has gradually become a research hotspot due to the development of generative pretrained transformer. However, compared to natural scene images, remote sensing images have a wide range of spatial scales, complex objects, and limited labelled samples, which introduce great challenges to image interpretation. To reduce the gap between nature scene images and remote sensing images, this paper proposes a novel RLita optimization method for foundation models. Specifically, a region-level image–text alignment optimization method is proposed to represent the features of images and texts as visual and sematic representation vectors in one embedding space for better model generalization, and a parameter-efficient tuning strategy is designed to reduce computational resources. Experiments on five remote sensing datasets including object detection, semantic segmentation, and change detection show the effectiveness of the RLita method.

Published in Remote Sensing

ISSN: 2072-4292 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science
Website: http://www.mdpi.com/journal/remotesensing/

About the journal

Abstract

Keywords