IEEE Access (Jan 2024)

Prediction of Chemical Compounds Biodegradability: Molecular Fingerprint-Based Machine Learning Models

  • Alaa M. Elsayad,
  • Medien Zeghid,
  • Hassan Yousif Ahmed,
  • Khaled A. Elsayad

DOI
https://doi.org/10.1109/ACCESS.2024.3461164
Journal volume & issue
Vol. 12
pp. 135577 – 135588

Abstract

Read online

This work evaluates the performance of four machine learning models (MLMs): support vector machine (SVM), K-nearest neighbor (KNN), discriminant analysis (DA), and logistic regression (LR) in predicting the biodegradability of chemicals, a critical factor for assessing environmental risks. For this purpose, the RDKit library was initially employed to extract nine fingerprints from a dataset consisting of 1717 chemical compounds. Subsequently, the Information Gain (IG) feature ranking algorithm was used to identify the top 100 predictive features for each fingerprint. The MLM Hyperparameters were optimized, and informative features were selected using a multi-objective genetic algorithm (MOGA), which identified dominant pairs (MLM/fingerprint) maximizing F1-score while minimizing number of features. MACCS, Layered, and Avalon molecular fingerprints associated with the different MLM showed superior performance, with KNN/MACCS achieving a cross-validated F1-score of 87.63% using only 13 features. Afterwards, the final classification models were constructed using the solution with the highest cross-validated F1-score complement for each combination of MLM/fingerprint. These models were then evaluated on both the training and test datasets. In the training subset, the MACCS fingerprint consistently outperformed others, achieving cross-validated AUC scores above 90% for all models (SVM: 91%, KNN: 90.4%, DA: 90.9%, LR: 91.32%). The SVM/MACCS pair demonstrated the highest accuracy (94.17%), specificity (95.84%), and F1-score (93.14%). In the test subset, the SVM/Layered pair exhibited the highest accuracy (84.01%) and specificity (88.09%). The DA/Avalon combination achieved the highest sensitivity (84.40%), while the SVM/Avalon pair reached the highest F1-score (82.58%). These results underscore the effectiveness of MACCS fingerprints for biodegradation classification across various models. Additionally, Permutation Feature Importance (PFI) and Shapley Additive Explanations (SHAP) have identified the key MACCS features influencing biodegradation classifications in the SVM model. These methods highlighted MACCS bit numbers 154, 145, 142, 144, and 156 as the most crucial contributors to the SVM model’s predictions.

Keywords