IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (Jan 2024)

Self-Supervised Spatio-Temporal Representation Learning of Satellite Image Time Series

  • Iris Dumeur,
  • Silvia Valero,
  • Jordi Inglada

DOI
https://doi.org/10.1109/JSTARS.2024.3358066
Journal volume & issue
Vol. 17
pp. 4350 – 4367

Abstract

Read online

In this article, a new self-supervised strategy for learning meaningful representations of complex optical satellite image time series (SITS) is presented. The methodology proposed, named Unet-BERT spAtio-temporal Representation eNcoder (U-BARN), exploits irregularly sampled SITS. The designed architecture allows learning rich and discriminative features from unlabeled data, enhancing the synergy between the spatio-spectral and the temporal dimensions. To train on unlabeled data, a time-series reconstruction pretext task inspired by the BERT strategy but adapted to SITS is proposed. A Sentinel-2 large-scale unlabeled dataset is used to pretrain U-BARN. During the pretraining, U-BARN processes annual time series composed of a maximum of 100 dates. To demonstrate its feature learning capability, representations of SITS encoded by U-BARN are then fed into a shallow classifier to generate semantic segmentation maps. Experimental results are conducted on a labeled crop dataset (PASTIS) as well as a dense land cover dataset (MultiSenGE). Two ways of exploiting U-BARN pretraining are considered: either U-BARN weights are frozen or fine-tuned. The obtained results demonstrate that representations of SITS given by the frozen U-BARN are more efficient for land cover and crop classification than those of a supervised-trained linear layer. Then, we observe that fine-tuning boosts U-BARN performances on MultiSenGE dataset. In addition, we observe on PASTIS, in scenarios with scarce reference data that the fine-tuning brings a significative performance gain compared to fully supervised approaches. We also investigate the influence of the percentage of elements masked during pretraining on the quality of the SITS representation. Eventually, semantic segmentation performances show that the fully supervised U-BARN architecture reaches better performances than the spatio-temporal baseline (U-TAE) on both downstream tasks: crop and dense land cover segmentation.

Keywords