Ecotoxicology and Environmental Safety (Jul 2023)
Feature fusion improves performance and interpretability of machine learning models in identifying soil pollution of potentially contaminated sites
Abstract
Owing to the rapid development of big data technology, use of machine learning methods to identify soil pollution of potentially contaminated sites (PCS) at regional scales and in different industries has become a research hot spot. However, due to the difficulty in obtaining key indexes of site pollution sources and pathways, current methods have problems such as low accuracy of model predictions and insufficient scientific basis. In this study, we collected the environmental data of 199 PCS in 6 typical industries involving heavy metal and organic pollution. Then, 21 indexes based on basic information, potential for pollution from product and raw material, pollution control level, and migration capacity of soil pollutants were used to established the soil pollution identification index system. We fused the original indexes into the new feature subset with 11 indexes through the method of consolidation calculation. The new feature subset was then used to train machine learning models of random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP), and tested to determine whether it improved the accuracy and precision of soil pollination identification models. The results of correlation analysis showed that the four new indexes created by feature fusion have the correlation with soil pollution is similar to the original indexes. The accuracies and precisions of three machine learning models trained on the new feature subset were 67.4%− 72.9% and 72.0%− 74.7%, which were 2.1%− 2.5% and 0.3%− 5.7% higher than these of the models trained on original indexes, respectively. When the PCS were divided into typical heavy metal and organic pollution sites according to the enterprise industries, the accuracy of the model trained on the two datasets for identifying soil heavy metal and organic pollution were significantly improve to approximately 80%. Owing to the imbalance in positive and negative samples in the prediction of soil organic pollution, the precisions of soil organic pollution identification models were 58%− 72.5%, which were significantly lower than their accuracies. According to the factors analysis based on the model interpretability of SHAP, most of the indexes of basic information, potential for pollution from product and raw material, and pollution control level had different degrees of impact on soil pollution. However, the indexes of migration capacity of soil pollutants had the least effect in the classification task of soil pollution identification of PCS. Among the indexes, traces of soil pollution, industrial utilization years/start-up time, pollution control risk scores and enterprise scale having the greatest effects on soil pollution with the mean SHAP values of 0.17–0.36, which reflected their contribution rate on soil pollution and could help to optimize the current index scoring of the technical regulation for identifying site soil pollution. This study provides a new technical method to identify soil pollution based on big data and machine learning methods, in addition to providing a reference and scientific basis for environmental management and soil pollution control of PCS.