International Journal of Applied Earth Observations and Geoinformation (Jul 2024)

Spatial-temporal distribution of labeled set bias remote sensing estimation: An implication for supervised machine learning in water quality monitoring

  • Yadong Zhou,
  • Wen Li,
  • Xiaoyu Cao,
  • Boayin He,
  • Qi Feng,
  • Fan Yang,
  • Hui Liu,
  • Tiit Kutser,
  • Min Xu,
  • Fei Xiao,
  • Xueer Geng,
  • kai Yu,
  • Yun Du

Journal volume & issue
Vol. 131
p. 103959

Abstract

Read online

Supervised machine learning (SML) has become a crucial tool for estimating water quality parameters (WQPs) from satellite images. Its effectiveness relies heavily on synchronised in-situ datasets covering diverse water bodies. However, collecting such datasets is time-consuming, resulting in temporal gaps between sampling and imaging. In addition, the in situ dataset may exhibit an imbalance. These imperfections could introduce uncertainties to SML-derived models, compromising the accuracy of the WQP estimates. Using in situ data collected automatically every four hours, the estimation of both optically active parameters (OAPs) and non-optically active parameters (nOAPs) in the Middle Reaches of the Yangtze River (MRYR) serves as an example to illustrate the importance of this challenge in freshwater remote sensing. Additionally, the investigation was extended to estimate OAPs and nOAPs in lakes of Wuhan through manual sampling measurements, thereby bridging theoretical insights with real-world applications. Employing four ML algorithms, the SML-based models for each WQP were calibrated using in situ datasets with different spatio-temporal distributions. The results demonstrated that precision decreased with increasing time gaps, whereas most nOAPs (COD, TP, TN, pH, and DO) showed greater robustness to the time gap than the OAPs (turbidity, Secchi depth, Chl-a, and algae density). The mean absolute percentage errors (MAPEs) of these nOAPs were as follows: for all models, pH MAPEs < 6 % and DO MAPEs < 30 %; for models using datasets with time gaps of 0 − ±4 days, MAPEs < 50 % for COD, TP, and TN. All models for NH3-N estimation were invalid for both the MRYR water bodies and real-world applications in Wuhan. The model accuracy decreased slightly as the sample size decreased sharply with the constraint of the minimum gap (within ± 1 days). Furthermore, the SML model based onimbalanced labels produced smooth estimates, resulting in under- or over-estimation in different waters. Thus, adopting a stricter matching for OAPs (0∼±1days) than for nOAPs (0∼±3days) is recommend, while ensuring a sufficient sample size for learning. Solutions for long-tailed learning are suggested to address the imbalanced labels in further studies.

Keywords