BioMedInformatics (Mar 2024)

The Effect of Data Missingness on Machine Learning Predictions of Uncontrolled Diabetes Using All of Us Data

  • Zain Jabbar,
  • Peter Washington

DOI
https://doi.org/10.3390/biomedinformatics4010043
Journal volume & issue
Vol. 4, no. 1
pp. 780 – 795

Abstract

Read online

Electronic Health Records (EHR) provide a vast amount of patient data that are relevant to predicting clinical outcomes. The inherent presence of missing values poses challenges to building performant machine learning models. This paper aims to investigate the effect of various imputation methods on the National Institutes of Health’s All of Us dataset, a dataset containing a high degree of data missingness. We apply several imputation techniques such as mean substitution, constant filling, and multiple imputation on the same dataset for the task of diabetes prediction. We find that imputing values causes heteroskedastic performance for machine learning models with increased data missingness. That is, the more missing values a patient has for their tests, the higher variance there is on a diabetes model AUROC, F1, precision, recall, and accuracy scores. This highlights a critical challenge in using EHR data for predictive modeling. This work highlights the need for future research to develop methodologies to mitigate the effects of missing data and heteroskedasticity in EHR-based predictive models.

Keywords