Machine learning coupled with causal inference to identify COVID-19 related chemicals that pose a high concern to drinking water
Min Han,
Jun Liang,
Biao Jin,
Ziwei Wang,
Wanlu Wu,
Hans Peter H. Arp
Affiliations
Min Han
State Key Laboratory of Organic Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou 510640, China; CAS Center for Excellence in Deep Earth Science, Guangzhou 510640, China; University of Chinese Academy of Sciences, Beijing 10069, China; Guangdong Provincial Key Laboratory of Environmental Protection and Resources Utilization, Guangzhou 510640, China
Jun Liang
School of Software, South China Normal University, Foshan 528225, China
Biao Jin
State Key Laboratory of Organic Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou 510640, China; CAS Center for Excellence in Deep Earth Science, Guangzhou 510640, China; University of Chinese Academy of Sciences, Beijing 10069, China; Guangdong Provincial Key Laboratory of Environmental Protection and Resources Utilization, Guangzhou 510640, China; Corresponding author
Ziwei Wang
State Key Laboratory of Organic Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou 510640, China; CAS Center for Excellence in Deep Earth Science, Guangzhou 510640, China; University of Chinese Academy of Sciences, Beijing 10069, China
Wanlu Wu
State Key Laboratory of Organic Geochemistry, Guangzhou Institute of Geochemistry, Chinese Academy of Sciences, Guangzhou 510640, China; CAS Center for Excellence in Deep Earth Science, Guangzhou 510640, China; University of Chinese Academy of Sciences, Beijing 10069, China
Hans Peter H. Arp
Norwegian Geotechnical Institute (NGI), P.O. Box 3930 Ullevaal Stadion, N-0806 Oslo, Norway; Norwegian University of Science and Technology (NTNU), NO-7491 Trondheim, Norway
Summary: Various synthetic substances were utilized in large quantities during the recent coronavirus pandemic, COVID-19. Some of these chemicals could potentially enter drinking water sources. Persistent, mobile, and toxic (PMT) substances have been recognized as a threat to drinking water resources. It has not yet been assessed how many COVID-19 related substances could be considered PMT substances. One reason is the lack of high-quality experimental data for the identification of PMT substances. To solve this problem, we applied a machine learning model to identify the PMT substances among COVID-19 related chemicals. The optimal model achieved an accuracy of 90.6% based on external test data. The model interpretation and causal inference indicated that our approach understood causation between PMT properties and molecular descriptors. Notably, the screening results showed that over 60% of the COVID-19 chemicals considered are candidate PMT substances, which should be prioritized to prevent undue pollution of water resources.