Geoderma (May 2024)

A framework for optimizing environmental covariates to support model interpretability in digital soil mapping

  • Babak Kasraei,
  • Margaret G. Schmidt,
  • Jin Zhang,
  • Chuck E. Bulmer,
  • Deepa S. Filatow,
  • Adrienne Arbor,
  • Travis Pennell,
  • Brandon Heung

Journal volume & issue
Vol. 445
p. 116873

Abstract

Read online

A common practice in digital soil mapping (DSM) is to incorporate many environmental covariates into a machine-learning algorithm to predict the spatial patterns of soil attributes. Variance inflation factor (VIF), principal component analysis (PCA), and recursive feature elimination (RFE) are three statistical methods that can be used to reduce the number of covariates. This study aims 1) to compare VIF and PCA approaches; 2) to identify an approach to determine the minimum number of covariates in DSM to ensure model parsimony using RFE after using VIF; and 3) to examine methods to interpret the impact of covariates on the variability of the predicted soil properties. The study area was the province of British Columbia (BC), Canada. This study used legacy data for four soil properties to make digital soil maps: soil organic carbon (SOC%), pH, clay%, and coarse fragment (CF%). Seven models were made for each soil property to determine the influence on validation results by using a different number of covariates produced by various methods on validation results. The results showed that the number of covariates could be reduced from 70 to 4 to 12 with only a little or no difference in concordance correlation coefficient (CCC) validation results. The CCC results of pH models using 70 and 7 covariates were both 0.74, and for other soil properties, this difference was negligible. The validation results obtained from PCA models showed that the performance of PCA in reducing the number of covariates was not as effective as when using VIF. Moreover, this study showed that covariates related to precipitation were the most important for modeling SOC%, soil pH, and clay%. Topographic covariates were the most influential covariates for modeling soil CF%. This study emphasizes the potential benefits of combining various data reduction methods to achieve optimal outcomes and generate the most parsimonious and interpretable models.

Keywords