Geoderma (Nov 2023)
Optimal sampling using Conditioned Latin Hypercube for digital soil mapping: An approach using Bhattacharyya distance
Abstract
Soil properties are important because they determine the soil’s suitability for different types of plant growth, ecosystems and biota functioning. Soil properties influence nutrient cycling, carbon sequestration and soil management. Digital Soil Mapping (DSM) is a procedure to map soil properties. Soil sampling for DSM is a foundational step in building prediction accuracy and essential for incorporating variability in terms of environmental covariates (ancillary variables). Conditioned Latin Hypercube (CLH) sampling is a method for generating a sample of points from a multivariate distribution that has been conditioned on one or more covariates. It is an extension of Latin Hypercube sampling, which is a popular technique for generating samples from a multivariate distribution in a way that ensures that each dimension is sampled uniformly. CLH sampling carries the benefit of selecting sampling locations covering the feature space and forming a Hypercube of the original sample. However, determining the optimum sample size is crucial in soil survey exercises constrained by budget and time limits. For this purpose, a study was carried out on Scotland's Finzean Estate (44.8 km2) location. A dataset of 21 independent features (16 continuous and five categorical) and 17,932 sampling locations was created using Digital Elevation Model (DEM) derivatives, soil classes map, land cover map, the peat depth map and parent material map to further generate sub-samples and compare the generated sub-samples with the original population. Two hundred CLH sampling datasets were extracted from the original population (17932 data points) with different sizes (5, 10, 15, 20, …, 100) and each size was given 10 repetitions e.g. (5_1, 5_2, …, 5_10). The sample datasets were analysed by comparing the mean, standard deviation, boxplot and estimates of the probability density function (pdf) for all the 16 continuous independent features. All the mentioned comparisons suggest that the impact of increasing sample size on the distribution of covariates can be observed up to a certain point, beyond which further increases in sample size may not yield noticeable differences. Bhattacharyya distance, a statistical measurement that quantifies the similarity between two probability distributions, was calculated between every quantitative and qualitative element of respective sampling size and for the original population. In contrast, as the CLH sample dataset size increased, the Bhattacharyya distance value decreased and became constant.. The optimum number of samples based on the study was determined for the spatial extent of the Finzean Estate in Scotland and a range of 25–50 CLH samples was suggested based on the study. This work, therefore, achieved both reductions in sampling location numbers compared to classical approaches and identification of the precise location of these sample location to achieve optimal DSM.