Atmosphere (Jun 2024)

Machine Learning Approach for the Estimation of Henry’s Law Constant Based on Molecular Descriptors

  • Atta Ullah,
  • Muhammad Shaheryar,
  • Ho-Jin Lim

DOI
https://doi.org/10.3390/atmos15060706
Journal volume & issue
Vol. 15, no. 6
p. 706

Abstract

Read online

In atmospheric chemistry, the Henry’s law constant (HLC) is crucial for understanding the distribution of organic compounds across gas, particle, and aqueous phases. Quantitative structure–property relationship (QSPR) models described in scientific research are generally tailored to specific groups or categories of substances and are often developed using a limited set of experimental data. This study developed a machine learning model using an extensive dataset of experimental HLCs for approximately 1100 organic compounds. Molecular descriptors calculated using alvaDesc software (v 2.0) were used to train the models. A hybrid approach was adopted for feature selection, ensuring alignment with the domain knowledge. Based on the root mean squared error (RMSE) of the training and test data after cross-validation, Gradient Boosting (GB) was selected as a model for predicting HLC. The hyperparameters of the selected model were optimized using the automated hyperparameter optimization framework Optuna. The impact of features on the target variable was assessed using the SHapley Additive exPlanations (SHAP). The optimized model demonstrated strong performance across the training, evaluation, and test datasets, achieving coefficients of determination (R2) of 0.96, 0.78, and 0.74, respectively. The developed model was used to estimate the HLC of compounds associated with carbon capture and storage (CCS) emissions and secondary organic aerosols.

Keywords