Feature fusion improves performance and interpretability of machine learning models in identifying soil pollution of potentially contaminated sites

Xiaosong Lu; Junyang Du; Liping Zheng; Guoqing Wang; Xuzhi Li; Li Sun; Xinghua Huang

Ecotoxicology and Environmental Safety (Jul 2023)

Feature fusion improves performance and interpretability of machine learning models in identifying soil pollution of potentially contaminated sites

Xiaosong Lu,
Junyang Du,
Liping Zheng,
Guoqing Wang,
Xuzhi Li,
Li Sun,
Xinghua Huang

Affiliations

Xiaosong Lu: State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
Junyang Du: State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
Liping Zheng: State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
Guoqing Wang: State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China; Correspondence to: Nanjing Institute of Environmental Science, Ministry of Ecology and Environment, #8 Jiangwangmiao Street, Nanjing 210042, China.
Xuzhi Li: State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
Li Sun: State Environmental Protection Key Laboratory of Soil Environmental Management and Pollution Control, Nanjing Institute of Environmental Sciences, Ministry of Ecology and Environment, Nanjing 210042, China
Xinghua Huang: College of Environmental Science and Engineering, Yangzhou University, Yangzhou 225127, China

Journal volume & issue: Vol. 259
p. 115052

Abstract

Read online

Owing to the rapid development of big data technology, use of machine learning methods to identify soil pollution of potentially contaminated sites (PCS) at regional scales and in different industries has become a research hot spot. However, due to the difficulty in obtaining key indexes of site pollution sources and pathways, current methods have problems such as low accuracy of model predictions and insufficient scientific basis. In this study, we collected the environmental data of 199 PCS in 6 typical industries involving heavy metal and organic pollution. Then, 21 indexes based on basic information, potential for pollution from product and raw material, pollution control level, and migration capacity of soil pollutants were used to established the soil pollution identification index system. We fused the original indexes into the new feature subset with 11 indexes through the method of consolidation calculation. The new feature subset was then used to train machine learning models of random forest (RF), support vector machine (SVM), and multilayer perceptron (MLP), and tested to determine whether it improved the accuracy and precision of soil pollination identification models. The results of correlation analysis showed that the four new indexes created by feature fusion have the correlation with soil pollution is similar to the original indexes. The accuracies and precisions of three machine learning models trained on the new feature subset were 67.4%− 72.9% and 72.0%− 74.7%, which were 2.1%− 2.5% and 0.3%− 5.7% higher than these of the models trained on original indexes, respectively. When the PCS were divided into typical heavy metal and organic pollution sites according to the enterprise industries, the accuracy of the model trained on the two datasets for identifying soil heavy metal and organic pollution were significantly improve to approximately 80%. Owing to the imbalance in positive and negative samples in the prediction of soil organic pollution, the precisions of soil organic pollution identification models were 58%− 72.5%, which were significantly lower than their accuracies. According to the factors analysis based on the model interpretability of SHAP, most of the indexes of basic information, potential for pollution from product and raw material, and pollution control level had different degrees of impact on soil pollution. However, the indexes of migration capacity of soil pollutants had the least effect in the classification task of soil pollution identification of PCS. Among the indexes, traces of soil pollution, industrial utilization years/start-up time, pollution control risk scores and enterprise scale having the greatest effects on soil pollution with the mean SHAP values of 0.17–0.36, which reflected their contribution rate on soil pollution and could help to optimize the current index scoring of the technical regulation for identifying site soil pollution. This study provides a new technical method to identify soil pollution based on big data and machine learning methods, in addition to providing a reference and scientific basis for environmental management and soil pollution control of PCS.

Published in Ecotoxicology and Environmental Safety

ISSN: 0147-6513 (Print); 1090-2414 (Online)
Publisher: Elsevier
Country of publisher: United States
LCC subjects: Technology: Environmental technology. Sanitary engineering: Environmental pollution; Geography. Anthropology. Recreation: Environmental sciences
Website: https://www.journals.elsevier.com/ecotoxicology-and-environmental-safety

About the journal

Abstract

Keywords