Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets

Thomas  M. Kaiser; Pieter  B. Burger

doi:10.3390/molecules24112115

Molecules (Jun 2019)

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets

Thomas M. Kaiser,
Pieter B. Burger

Affiliations

Thomas M. Kaiser: St Peter’s College, University of Oxford, New Inn Hall St, Oxford OX1 2DL, UK
Pieter B. Burger: Department of Drug Discovery and Biomedical Sciences, College of Pharmacy, Medical University of South Carolina, 280 Calhoun St. MSC 141, Charleston, SC 29425-1410, USA

DOI: https://doi.org/10.3390/molecules24112115
Journal volume & issue: Vol. 24, no. 11
p. 2115

Abstract

Read online

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.

Published in Molecules

ISSN: 1420-3049 (Online)
Publisher: MDPI AG
Country of publisher: Switzerland
LCC subjects: Science: Chemistry: Organic chemistry
Website: http://www.mdpi.com/journal/molecules

About the journal

Abstract

Keywords