Benchmarking Small-Dataset Structure-Activity-Relationship Models for Prediction of Wnt Signaling Inhibition

Mahtab Kokabi; Matthew Donnelly; Guangyu Xu

doi:10.1109/ACCESS.2020.3046190

IEEE Access (Jan 2020)

Benchmarking Small-Dataset Structure-Activity-Relationship Models for Prediction of Wnt Signaling Inhibition

Mahtab Kokabi,
Matthew Donnelly,
Guangyu Xu

Affiliations

Mahtab Kokabi: ORCiD; Department of Electrical and Computer Engineering, University of Massachusetts at Amherst, Amherst, MA, USA
Matthew Donnelly: Department of Electrical and Computer Engineering, University of Massachusetts at Amherst, Amherst, MA, USA
Guangyu Xu: ORCiD; Department of Electrical and Computer Engineering, University of Massachusetts at Amherst, Amherst, MA, USA

DOI: https://doi.org/10.1109/ACCESS.2020.3046190
Journal volume & issue: Vol. 8
pp. 228831 – 228840

Abstract

Read online

Quantitative structure-activity relationship (QSAR) models based on machine learning algorithms are powerful tools to expedite drug discovery processes and therapeutics development. Given the cost in acquiring large-sized training datasets, it is useful to examine if QSAR analysis can reasonably predict drug activity with only a small-sized dataset (size <; 100) and benchmark these small-dataset QSAR models in application-specific studies. To this end, here we present a systematic benchmarking study on small-dataset QSAR models built for prediction of effective Wnt signaling inhibitors, which are essential to therapeutics development in prevalent human diseases (e.g., cancer). Specifically, we examined a total of 72 two-dimensional (2D) QSAR models based on 4 best-performing algorithms, 6 commonly used molecular fingerprints, and 3 typical fingerprint lengths. We trained these models using a training dataset (56 compounds), benchmarked their performance on 4 figures-of-merit (FOMs), and examined their prediction accuracy using an external validation dataset (14 compounds). Our data show that the model performance is maximized when: 1) molecular fingerprints are selected to provide sufficient, unique, and not overly detailed representations of the chemical structures of drug compounds; 2) algorithms are selected to reduce the number of false predictions due to class imbalance in the dataset; and 3) models are selected to reach balanced performance on all 4 FOMs. These results may provide general guidelines in developing high-performance small-dataset QSAR models for drug activity prediction.

Published in IEEE Access

ISSN: 2169-3536 (Online)
Publisher: IEEE
Country of publisher: United States
LCC subjects: Technology: Electrical engineering. Electronics. Nuclear engineering
Website: https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6287639

About the journal

Abstract

Keywords