Experimental Uncertainty in Training Data for Protein-Ligand Binding Affinity Prediction Models

Carlos A. Hernández-Garrido; Norberto Sánchez-Cruz

Artificial Intelligence in the Life Sciences (Dec 2023)

Experimental Uncertainty in Training Data for Protein-Ligand Binding Affinity Prediction Models

Carlos A. Hernández-Garrido,
Norberto Sánchez-Cruz

Affiliations

Carlos A. Hernández-Garrido: Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Carretera Mérida-Tetiz Km. 4.5, 97357, Ucú, Yucatán, Mexico
Norberto Sánchez-Cruz: Instituto de Química, Unidad Mérida, Universidad Nacional Autónoma de México, Carretera Mérida-Tetiz Km. 4.5, 97357, Ucú, Yucatán, Mexico; Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas Unidad Mérida, Universidad Nacional Autónoma de México, Sierra Papacál, 97302, Mérida, Yucatán, Mexico; Corresponding author.

Journal volume & issue: Vol. 4
p. 100087

Abstract

Read online

The accuracy of machine learning models for protein-ligand binding affinity prediction depends on the quality of the experimental data they are trained on. Most of these models are trained and tested on different subsets of the PDBbind database, which is the main source of protein-ligand complexes with annotated binding affinity in the public domain. However, estimating its experimental uncertainty is not straightforward because just a few protein-ligand complexes have more than one measurement associated. In this work, we analyze bioactivity data from ChEMBL to estimate the experimental uncertainty associated with the three binding affinity measures included in the PDBbind (Ki, Kd, and IC50), as well as the effect of combining them. The experimental uncertainty of combining these three affinity measures was characterized by a mean absolute error of 0.78 logarithmic units, a root mean square error of 1.04 and a Pearson correlation coefficient of 0.76. These estimations were contrasted with the performances obtained by state-of-the-art machine learning models for binding affinity prediction, showing that these models tend to be overoptimistic when evaluated on the core set from PDBbind.

Published in Artificial Intelligence in the Life Sciences

ISSN: 2667-3185 (Online)
Publisher: Elsevier
Country of publisher: Netherlands
LCC subjects: Science: Science (General)
Website: https://www.journals.elsevier.com/artificial-intelligence-in-the-life-sciences

About the journal

Abstract

Keywords