A novel interpretable machine learning model approach for the prediction of TiO2 photocatalytic degradation of air contaminants

Rodrigo Teixeira Schossler; Samuel Ojo; Zhuoying Jiang; Jiajie Hu; Xiong Yu

doi:10.1038/s41598-024-62450-z

Scientific Reports (Jun 2024)

A novel interpretable machine learning model approach for the prediction of TiO2 photocatalytic degradation of air contaminants

Rodrigo Teixeira Schossler,
Samuel Ojo,
Zhuoying Jiang,
Jiajie Hu,
Xiong Yu

Affiliations

Rodrigo Teixeira Schossler: Department of Civil and Environmental Engineering, Case Western Reserve University
Samuel Ojo: Department of Civil and Environmental Engineering, Case Western Reserve University
Zhuoying Jiang: Department of Civil and Environmental Engineering, Case Western Reserve University
Jiajie Hu: Department of Civil and Environmental Engineering, Case Western Reserve University
Xiong Yu: Department of Civil and Environmental Engineering, Case Western Reserve University

DOI: https://doi.org/10.1038/s41598-024-62450-z
Journal volume & issue: Vol. 14, no. 1
pp. 1 – 15

Abstract

Read online

Abstract Air contaminants lead to various environmental and health issues. Titanium dioxide (TiO2) features the benefits of autogenous photocatalytic degradation of air contaminants. To evaluate its performance, laboratory experiments are commonly used to determine the kinetics of the photocatalytic-degradation rate, which is labor intensive, time-consuming, and costly. In this study, Machine Learning (ML) models were developed to predict the photo-degradation rate constants of air-borne organic contaminants with TiO2 nanoparticles and ultraviolet irradiation. The hyperparameters of the ML models were optimized, which included Artificial Neural Network (ANN) with Bayesian optimization, gradient booster regressor (GBR) with Bayesian optimization, Extreme Gradient Boosting (XGBoost) with optimization using Hyperopt, and Catboost combined with Adaboost. The organic contaminant was encoded through Molecular fingerprints (MF). Imputation method was applied to deal with the missing data. A generative ML model Vanilla Gan was utilized to create synthetic data to further augment the size of available dataset and the SHapley Additive exPlanations (SHAP) was employed for ML model interpretability. The results indicated that data imputation allowed for the full utilization of the limited dataset, leading to good machine learning prediction performance and preventing common overfitting problems with small-sized data. Additionally, augmenting experimental data with synthetic data significantly improved prediction accuracy and considerably reduced overfitting issues. The results ranked the feature importance and assessed the impacts of different experimental variables on the rate of photo-degradation, which were consistent with physico-chemical laws.

Published in Scientific Reports

ISSN: 2045-2322 (Online)
Publisher: Nature Portfolio
Country of publisher: United Kingdom
LCC subjects: Medicine; Science
Website: https://www.nature.com/srep/

About the journal

Abstract

Keywords